Module 4: File Systems & I/O
Primary text: Operating Systems: Three Easy Pieces (OSTEP), persistence chapters
Selective support: Operating System Concepts (Silberschatz) for the device / I/O subsystem and VFS framing, Unix Network Programming (Stevens) for the I/O models (select, poll, epoll), kernel.org and LWN for current Linux behavior
This guide is the primary teacher. You do not need to read OSTEP's persistence section front-to-back. You do need to come out of this module able to explain, from a syscall at the top to a sector write at the bottom, what the file system is doing on your behalf and when it is lying.
Scope of This Module
This module is not a survey of every file system. It is where the abstraction of "a file" becomes operational: you can name the pieces, you can trace reads and writes, and you can reason about what survives a crash.
What it covers in depth:
- the file abstraction: byte streams, inodes, metadata, and what a "file" really is on disk
- directories as a special kind of file and the name-to-inode translation
- file descriptors, the open-file table, and reference-counted I/O handles
- disks, sectors, LBAs, and the block interface the OS sees
- the canonical FS layout: superblocks, inode tables, data blocks, bitmaps
- FAT vs ext2/3/4 vs log-structured file systems as three layout philosophies
- the crash consistency problem and why naive updates corrupt the FS
- journaling with write-ahead logs and commit ordering
- copy-on-write file systems (ZFS, Btrfs) and why they avoid in-place update
- the page cache, the buffer cache, and what "cached" means across a reboot
- read-ahead, write-back, and
fsyncsemantics on Linux - sequential versus random I/O and why the gap is enormous
- blocking, non-blocking, and readiness-based I/O with
select/poll/epoll - asynchronous I/O: POSIX
aio_*andio_uring - device I/O from the OS side: drivers, interrupts, and DMA
What it deliberately does not try to finish here:
- distributed file systems (NFS, AFS) beyond a mention
- database storage engines and page-level recovery
- the full Linux VFS internals at source level
- RAID and storage arrays beyond the context needed for crash consistency
This is the persistence half of the OS foundation. Module 5 (networking) assumes you are already comfortable with fds, readiness, and the event-loop shape introduced here.
Before You Start
Answer these closed-book before starting the main path:
- What is an inode, and how does it differ from a file name?
- If you call
write(fd, buf, 4096)and the function returns, is the data on disk? - Why does
selectbecome slow when you are watching 50,000 sockets? - What is the difference between
fsync(fd)andsync()? - Why does doubling a file's size in FAT require a chain of pointer updates?
Diagnostic Interpretation
4-5 solid answers
- You are ready for the full path. Expect Cluster 3 (crash consistency) to still be new.
2-3 solid answers
- Continue, but expect extra time in Cluster 2 (on-disk layout) and Cluster 5 (I/O models).
0-1 solid answers
- Revisit Semester 4 Module 3 (systems programming,
read/writesyscalls) before starting. Those modules built the syscall habits this one assumes.
What This Module Is For
Every durable system in the rest of the degree assumes storage that "just works." This module is where that assumption stops being magical. Throughout later work you will be asked:
- if the database crashes mid-commit, what actually survives?
- why is
fsyncexpensive, and when can I skip it? - given 10,000 concurrent connections, how do I avoid one thread per socket?
- why does my log-append workload beat my random-update workload by 100x?
- when does
io_uringwin overepoll+ non-blocking I/O?
This module builds the operational reasoning needed for:
- database storage engines and WAL-based recovery
- event-driven servers and high-concurrency I/O
- container and cloud storage primitives
- observability on I/O paths (iostat,
strace,blktrace, eBPF)
You are learning to think about persistence and I/O without handwaving.
Concept Map
How To Use This Module
Work in order. Cluster 1 fixes vocabulary. Cluster 2 fixes layout. Cluster 3 fixes correctness under failure. Cluster 4 fixes performance. Cluster 5 fixes the syscall path.
Cluster 1: The File Abstraction
| Order | Concept | Type | Focus |
|---|---|---|---|
| 1 | What a File Is: Byte Stream, Metadata, Inodes | PRIMARY | The split between name, inode, and data blocks |
| 2 | Directories Are Just Files With Structured Contents | PRIMARY | Name-to-inode tables, traversal, and hard/symlinks |
| 3 | File Descriptors, Open-File Tables, Reference Counting | PRIMARY | Per-process, system-wide, and inode-level tables |
Cluster mastery check: Given open("/a/b/c.txt", O_RDWR), can you name every kernel data structure the call touches and what it writes into each?
Cluster 2: On-Disk Layout and Structure
| Order | Concept | Type | Focus |
|---|---|---|---|
| 4 | Disks, Sectors, LBAs, and the Block Interface | PRIMARY | What the OS sees beneath the file system |
| 5 | Superblocks, Inodes, Data Blocks: Typical FS Layout | PRIMARY | The canonical on-disk structures and how they fit |
| 6 | FAT vs ext2/3/4 vs Log-Structured File Systems | SUPPORTING | Three layout philosophies compared |
Cluster mastery check: Draw an ext-style on-disk layout with block addresses, and trace read(fd, ...) through it to the final LBA.
Cluster 3: Crash Consistency
| Order | Concept | Type | Focus |
|---|---|---|---|
| 7 | The Crash Consistency Problem | PRIMARY | Why simple updates leave the FS in partial states |
| 8 | Journaling: Write-Ahead Logs and Commit Ordering | PRIMARY | Data vs metadata journaling and commit discipline |
| 9 | Copy-on-Write File Systems: ZFS and Btrfs | SUPPORTING | Atomic root-pointer switches as an alternative |
Cluster mastery check: For an append that spans three on-disk writes, trace what happens if the machine loses power between each pair, under both ext4 journaled and naive.
Cluster 4: Caching and Performance
| Order | Concept | Type | Focus |
|---|---|---|---|
| 10 | Page Cache and Buffer Cache | PRIMARY | Unified caching and what "cached" really means |
| 11 | Read-Ahead, Write-Back, fsync Semantics | PRIMARY | When writes actually reach the platter |
| 12 | Sequential vs Random I/O Performance | SUPPORTING | The hundred-fold gap and why it shapes designs |
Cluster mastery check: Explain the full life of a write that is dirtied, cached, written back 30 seconds later, and hit by a SIGKILL at minute 5.
Cluster 5: I/O Models and the Syscall Path
| Order | Concept | Type | Focus |
|---|---|---|---|
| 13 | Blocking, Non-blocking, and select / poll / epoll | PRIMARY | Readiness-based I/O and its scaling characteristics |
| 14 | Asynchronous I/O: aio_* and io_uring | PRIMARY | True completion-based I/O on Linux |
| 15 | Device I/O: Drivers, Interrupts, DMA | SUPPORTING | What the OS does to move bytes to hardware |
Cluster mastery check: Explain why epoll + non-blocking I/O scales beyond select, and why io_uring can still beat both under write-heavy workloads.
Then work these practice pages:
| Order | Practice path | Focus |
|---|---|---|
| 1 | File System Internals Workshop | Inodes, directories, open-file tables, layout tracing |
| 2 | Crash Consistency Clinic | Recovery scenarios with and without journaling |
| 3 | I/O Performance Lab | Page cache, fsync cost, sequential vs random benchmarks |
| 4 | Code Katas: FUSE, fsync, epoll, Recovery | Implement a small FS, an epoll loop, measure and demonstrate |
Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.
Learning Objectives
By the end of this module you should be able to:
- Draw the kernel data structures involved in
open,read,write,close, and explain which structures are per-process, per-open-file, and per-inode. - Trace a byte offset in a user-space
readdown to a specific LBA on disk, through the VFS, file-system-specific code, page cache, block layer, and driver. - Explain the canonical on-disk layout of an ext-style file system, including superblock, inode table, data bitmap, and inode bitmap.
- Contrast FAT-style, ext-style, and log-structured file systems in terms of allocation, metadata, and update cost.
- State the crash consistency problem and show why naive in-place updates leave the FS in an inconsistent state.
- Explain journaling (data and metadata modes) and how write-ahead logs with commit ordering restore consistency.
- Explain copy-on-write file systems and why they achieve atomic updates without a separate journal.
- Distinguish the page cache from the buffer cache, explain write-back and read-ahead, and state exactly what
fsyncguarantees. - Quantify the performance gap between sequential and random I/O on HDD and SSD, and justify a given design's I/O pattern.
- Select between blocking I/O,
select/poll,epoll, andio_uringfor a given workload, and defend the choice against scaling and throughput constraints.
Outputs
- a filesystem notebook with at least 10 traced syscalls annotated down to the on-disk layer
- a crash recovery log: for each of at least 6 scenarios, record what the FS looks like post-crash with and without journaling
- a small FUSE filesystem that implements
open,read,write,readdirfor a backing directory or memory map - a measured
fsynccost report: throughput and latency for appends with and withoutfsync, on both SSD and (if available) HDD - an
epoll-based echo server that handles at least 1,000 concurrent connections - an I/O benchmark report comparing sequential and random reads at 4 KiB, 64 KiB, and 1 MiB block sizes
- a mistake journal of I/O-path confusions you want your future self not to repeat
- a short write-up relating Module 4 persistence habits to the networking and distributed-systems modules ahead
Completion Standard
You have completed Module 4 when all of these are true:
- you can describe what
open("/a/b.txt")touches at every layer without looking anything up - you can explain when
writereturning does and does not mean "on disk" - you can reason about crash outcomes for append, rename, and truncate operations
- you can write a minimal event loop with
epolland explain whyselectis the wrong tool at scale - you have measured, yourself, the cost of
fsyncon at least one real device - you have at least one real debugging narrative from running
strace,iostat, orblktrace
If you can edit a file but cannot explain which structures on disk just changed, the module is not complete.
Reading Policy
- Concept pages are the main path.
- Local book chunks are selective reinforcement, not a second syllabus.
Read only if stuckmeans try the concept page, self-check, and drill first.Optional deep divemeans additional nuance or production-grade detail, not required progression.- External links (kernel.org, LWN, unixism.net, Brendan Gregg) are validated and stable; use them for current Linux behavior that textbooks predate.
Suggested Weekly Flow
| Day | Work |
|---|---|
| 1 | Concepts 1-3, draw open-file table for three processes sharing a file |
| 2 | Concepts 4-5, trace read through VFS to LBA on paper |
| 3 | Concept 6 and Practice 1 |
| 4 | Concepts 7-9, draw crash scenarios before and after each write |
| 5 | Practice 2 (crash recovery) and fsync measurement |
| 6 | Concepts 10-12 and Practice 3 (I/O performance) |
| 7 | Concepts 13-15 |
| 8 | Practice 4 kata 1 (FUSE filesystem) |
| 9 | Practice 4 katas 2-4 (fsync cost, epoll loop, recovery demo) |
| 10 | Quiz, mistake-log cleanup, cross-module write-up |
Reference
If you need exact links into the local chunked books, use Reference and Selective Reading.
The Git tutorial is a content-addressable filesystem on top of a normal filesystem — the cleanest possible example of the abstractions this module covers. See Build Your Own X overview.
Rich Learning Pages
Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread