Skip to main content

Module 4: File Systems & I/O

Primary text: Operating Systems: Three Easy Pieces (OSTEP), persistence chapters
Selective support: Operating System Concepts (Silberschatz) for the device / I/O subsystem and VFS framing, Unix Network Programming (Stevens) for the I/O models (select, poll, epoll), kernel.org and LWN for current Linux behavior

This guide is the primary teacher. You do not need to read OSTEP's persistence section front-to-back. You do need to come out of this module able to explain, from a syscall at the top to a sector write at the bottom, what the file system is doing on your behalf and when it is lying.


Scope of This Module

This module is not a survey of every file system. It is where the abstraction of "a file" becomes operational: you can name the pieces, you can trace reads and writes, and you can reason about what survives a crash.

What it covers in depth:

  • the file abstraction: byte streams, inodes, metadata, and what a "file" really is on disk
  • directories as a special kind of file and the name-to-inode translation
  • file descriptors, the open-file table, and reference-counted I/O handles
  • disks, sectors, LBAs, and the block interface the OS sees
  • the canonical FS layout: superblocks, inode tables, data blocks, bitmaps
  • FAT vs ext2/3/4 vs log-structured file systems as three layout philosophies
  • the crash consistency problem and why naive updates corrupt the FS
  • journaling with write-ahead logs and commit ordering
  • copy-on-write file systems (ZFS, Btrfs) and why they avoid in-place update
  • the page cache, the buffer cache, and what "cached" means across a reboot
  • read-ahead, write-back, and fsync semantics on Linux
  • sequential versus random I/O and why the gap is enormous
  • blocking, non-blocking, and readiness-based I/O with select / poll / epoll
  • asynchronous I/O: POSIX aio_* and io_uring
  • device I/O from the OS side: drivers, interrupts, and DMA

What it deliberately does not try to finish here:

  • distributed file systems (NFS, AFS) beyond a mention
  • database storage engines and page-level recovery
  • the full Linux VFS internals at source level
  • RAID and storage arrays beyond the context needed for crash consistency

This is the persistence half of the OS foundation. Module 5 (networking) assumes you are already comfortable with fds, readiness, and the event-loop shape introduced here.


Before You Start

Answer these closed-book before starting the main path:

  1. What is an inode, and how does it differ from a file name?
  2. If you call write(fd, buf, 4096) and the function returns, is the data on disk?
  3. Why does select become slow when you are watching 50,000 sockets?
  4. What is the difference between fsync(fd) and sync()?
  5. Why does doubling a file's size in FAT require a chain of pointer updates?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path. Expect Cluster 3 (crash consistency) to still be new.

2-3 solid answers

  • Continue, but expect extra time in Cluster 2 (on-disk layout) and Cluster 5 (I/O models).

0-1 solid answers

  • Revisit Semester 4 Module 3 (systems programming, read/write syscalls) before starting. Those modules built the syscall habits this one assumes.

What This Module Is For

Every durable system in the rest of the degree assumes storage that "just works." This module is where that assumption stops being magical. Throughout later work you will be asked:

  • if the database crashes mid-commit, what actually survives?
  • why is fsync expensive, and when can I skip it?
  • given 10,000 concurrent connections, how do I avoid one thread per socket?
  • why does my log-append workload beat my random-update workload by 100x?
  • when does io_uring win over epoll + non-blocking I/O?

This module builds the operational reasoning needed for:

  • database storage engines and WAL-based recovery
  • event-driven servers and high-concurrency I/O
  • container and cloud storage primitives
  • observability on I/O paths (iostat, strace, blktrace, eBPF)

You are learning to think about persistence and I/O without handwaving.


Concept Map


How To Use This Module

Work in order. Cluster 1 fixes vocabulary. Cluster 2 fixes layout. Cluster 3 fixes correctness under failure. Cluster 4 fixes performance. Cluster 5 fixes the syscall path.

Cluster 1: The File Abstraction

OrderConceptTypeFocus
1What a File Is: Byte Stream, Metadata, InodesPRIMARYThe split between name, inode, and data blocks
2Directories Are Just Files With Structured ContentsPRIMARYName-to-inode tables, traversal, and hard/symlinks
3File Descriptors, Open-File Tables, Reference CountingPRIMARYPer-process, system-wide, and inode-level tables

Cluster mastery check: Given open("/a/b/c.txt", O_RDWR), can you name every kernel data structure the call touches and what it writes into each?

Cluster 2: On-Disk Layout and Structure

OrderConceptTypeFocus
4Disks, Sectors, LBAs, and the Block InterfacePRIMARYWhat the OS sees beneath the file system
5Superblocks, Inodes, Data Blocks: Typical FS LayoutPRIMARYThe canonical on-disk structures and how they fit
6FAT vs ext2/3/4 vs Log-Structured File SystemsSUPPORTINGThree layout philosophies compared

Cluster mastery check: Draw an ext-style on-disk layout with block addresses, and trace read(fd, ...) through it to the final LBA.

Cluster 3: Crash Consistency

OrderConceptTypeFocus
7The Crash Consistency ProblemPRIMARYWhy simple updates leave the FS in partial states
8Journaling: Write-Ahead Logs and Commit OrderingPRIMARYData vs metadata journaling and commit discipline
9Copy-on-Write File Systems: ZFS and BtrfsSUPPORTINGAtomic root-pointer switches as an alternative

Cluster mastery check: For an append that spans three on-disk writes, trace what happens if the machine loses power between each pair, under both ext4 journaled and naive.

Cluster 4: Caching and Performance

OrderConceptTypeFocus
10Page Cache and Buffer CachePRIMARYUnified caching and what "cached" really means
11Read-Ahead, Write-Back, fsync SemanticsPRIMARYWhen writes actually reach the platter
12Sequential vs Random I/O PerformanceSUPPORTINGThe hundred-fold gap and why it shapes designs

Cluster mastery check: Explain the full life of a write that is dirtied, cached, written back 30 seconds later, and hit by a SIGKILL at minute 5.

Cluster 5: I/O Models and the Syscall Path

OrderConceptTypeFocus
13Blocking, Non-blocking, and select / poll / epollPRIMARYReadiness-based I/O and its scaling characteristics
14Asynchronous I/O: aio_* and io_uringPRIMARYTrue completion-based I/O on Linux
15Device I/O: Drivers, Interrupts, DMASUPPORTINGWhat the OS does to move bytes to hardware

Cluster mastery check: Explain why epoll + non-blocking I/O scales beyond select, and why io_uring can still beat both under write-heavy workloads.

Then work these practice pages:

OrderPractice pathFocus
1File System Internals WorkshopInodes, directories, open-file tables, layout tracing
2Crash Consistency ClinicRecovery scenarios with and without journaling
3I/O Performance LabPage cache, fsync cost, sequential vs random benchmarks
4Code Katas: FUSE, fsync, epoll, RecoveryImplement a small FS, an epoll loop, measure and demonstrate

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. Draw the kernel data structures involved in open, read, write, close, and explain which structures are per-process, per-open-file, and per-inode.
  2. Trace a byte offset in a user-space read down to a specific LBA on disk, through the VFS, file-system-specific code, page cache, block layer, and driver.
  3. Explain the canonical on-disk layout of an ext-style file system, including superblock, inode table, data bitmap, and inode bitmap.
  4. Contrast FAT-style, ext-style, and log-structured file systems in terms of allocation, metadata, and update cost.
  5. State the crash consistency problem and show why naive in-place updates leave the FS in an inconsistent state.
  6. Explain journaling (data and metadata modes) and how write-ahead logs with commit ordering restore consistency.
  7. Explain copy-on-write file systems and why they achieve atomic updates without a separate journal.
  8. Distinguish the page cache from the buffer cache, explain write-back and read-ahead, and state exactly what fsync guarantees.
  9. Quantify the performance gap between sequential and random I/O on HDD and SSD, and justify a given design's I/O pattern.
  10. Select between blocking I/O, select/poll, epoll, and io_uring for a given workload, and defend the choice against scaling and throughput constraints.

Outputs

  • a filesystem notebook with at least 10 traced syscalls annotated down to the on-disk layer
  • a crash recovery log: for each of at least 6 scenarios, record what the FS looks like post-crash with and without journaling
  • a small FUSE filesystem that implements open, read, write, readdir for a backing directory or memory map
  • a measured fsync cost report: throughput and latency for appends with and without fsync, on both SSD and (if available) HDD
  • an epoll-based echo server that handles at least 1,000 concurrent connections
  • an I/O benchmark report comparing sequential and random reads at 4 KiB, 64 KiB, and 1 MiB block sizes
  • a mistake journal of I/O-path confusions you want your future self not to repeat
  • a short write-up relating Module 4 persistence habits to the networking and distributed-systems modules ahead

Completion Standard

You have completed Module 4 when all of these are true:

  • you can describe what open("/a/b.txt") touches at every layer without looking anything up
  • you can explain when write returning does and does not mean "on disk"
  • you can reason about crash outcomes for append, rename, and truncate operations
  • you can write a minimal event loop with epoll and explain why select is the wrong tool at scale
  • you have measured, yourself, the cost of fsync on at least one real device
  • you have at least one real debugging narrative from running strace, iostat, or blktrace

If you can edit a file but cannot explain which structures on disk just changed, the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • Local book chunks are selective reinforcement, not a second syllabus.
  • Read only if stuck means try the concept page, self-check, and drill first.
  • Optional deep dive means additional nuance or production-grade detail, not required progression.
  • External links (kernel.org, LWN, unixism.net, Brendan Gregg) are validated and stable; use them for current Linux behavior that textbooks predate.

Suggested Weekly Flow

DayWork
1Concepts 1-3, draw open-file table for three processes sharing a file
2Concepts 4-5, trace read through VFS to LBA on paper
3Concept 6 and Practice 1
4Concepts 7-9, draw crash scenarios before and after each write
5Practice 2 (crash recovery) and fsync measurement
6Concepts 10-12 and Practice 3 (I/O performance)
7Concepts 13-15
8Practice 4 kata 1 (FUSE filesystem)
9Practice 4 katas 2-4 (fsync cost, epoll loop, recovery demo)
10Quiz, mistake-log cleanup, cross-module write-up

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.


Build Your Own X — elective

The Git tutorial is a content-addressable filesystem on top of a normal filesystem — the cleanest possible example of the abstractions this module covers. See Build Your Own X overview.

Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread