Module 4: File Systems & I/O

Primary text: Operating Systems: Three Easy Pieces (OSTEP), persistence chapters
Selective support: Operating System Concepts (Silberschatz) for the device / I/O subsystem and VFS framing, Unix Network Programming (Stevens) for the I/O models (select, poll, epoll), kernel.org and LWN for current Linux behavior

This guide is the primary teacher. You do not need to read OSTEP's persistence section front-to-back. You do need to come out of this module able to explain, from a syscall at the top to a sector write at the bottom, what the file system is doing on your behalf and when it is lying.

Scope of This Module

This module is not a survey of every file system. It is where the abstraction of "a file" becomes operational: you can name the pieces, you can trace reads and writes, and you can reason about what survives a crash.

What it covers in depth:

the file abstraction: byte streams, inodes, metadata, and what a "file" really is on disk
directories as a special kind of file and the name-to-inode translation
file descriptors, the open-file table, and reference-counted I/O handles
disks, sectors, LBAs, and the block interface the OS sees
the canonical FS layout: superblocks, inode tables, data blocks, bitmaps
FAT vs ext2/3/4 vs log-structured file systems as three layout philosophies
the crash consistency problem and why naive updates corrupt the FS
journaling with write-ahead logs and commit ordering
copy-on-write file systems (ZFS, Btrfs) and why they avoid in-place update
the page cache, the buffer cache, and what "cached" means across a reboot
read-ahead, write-back, and fsync semantics on Linux
sequential versus random I/O and why the gap is enormous
blocking, non-blocking, and readiness-based I/O with select / poll / epoll
asynchronous I/O: POSIX aio_* and io_uring
device I/O from the OS side: drivers, interrupts, and DMA

What it deliberately does not try to finish here:

distributed file systems (NFS, AFS) beyond a mention
database storage engines and page-level recovery
the full Linux VFS internals at source level
RAID and storage arrays beyond the context needed for crash consistency

This is the persistence half of the OS foundation. Module 5 (networking) assumes you are already comfortable with fds, readiness, and the event-loop shape introduced here.

Before You Start

Answer these closed-book before starting the main path:

What is an inode, and how does it differ from a file name?
If you call write(fd, buf, 4096) and the function returns, is the data on disk?
Why does select become slow when you are watching 50,000 sockets?
What is the difference between fsync(fd) and sync()?
Why does doubling a file's size in FAT require a chain of pointer updates?

Diagnostic Interpretation

4-5 solid answers

You are ready for the full path. Expect Cluster 3 (crash consistency) to still be new.

2-3 solid answers

Continue, but expect extra time in Cluster 2 (on-disk layout) and Cluster 5 (I/O models).

0-1 solid answers

Revisit Semester 4 Module 3 (systems programming, read/write syscalls) before starting. Those modules built the syscall habits this one assumes.

What This Module Is For

Every durable system in the rest of the degree assumes storage that "just works." This module is where that assumption stops being magical. Throughout later work you will be asked:

if the database crashes mid-commit, what actually survives?
why is fsync expensive, and when can I skip it?
given 10,000 concurrent connections, how do I avoid one thread per socket?
why does my log-append workload beat my random-update workload by 100x?
when does io_uring win over epoll + non-blocking I/O?

This module builds the operational reasoning needed for:

database storage engines and WAL-based recovery
event-driven servers and high-concurrency I/O
container and cloud storage primitives
observability on I/O paths (iostat, strace, blktrace, eBPF)

You are learning to think about persistence and I/O without handwaving.

Concept Map

How To Use This Module

Work in order. Cluster 1 fixes vocabulary. Cluster 2 fixes layout. Cluster 3 fixes correctness under failure. Cluster 4 fixes performance. Cluster 5 fixes the syscall path.

Cluster 1: The File Abstraction

Order	Concept	Type	Focus
1	What a File Is: Byte Stream, Metadata, Inodes	PRIMARY	The split between name, inode, and data blocks
2	Directories Are Just Files With Structured Contents	PRIMARY	Name-to-inode tables, traversal, and hard/symlinks
3	File Descriptors, Open-File Tables, Reference Counting	PRIMARY	Per-process, system-wide, and inode-level tables

Cluster mastery check: Given open("/a/b/c.txt", O_RDWR), can you name every kernel data structure the call touches and what it writes into each?

Cluster 2: On-Disk Layout and Structure

Order	Concept	Type	Focus
4	Disks, Sectors, LBAs, and the Block Interface	PRIMARY	What the OS sees beneath the file system
5	Superblocks, Inodes, Data Blocks: Typical FS Layout	PRIMARY	The canonical on-disk structures and how they fit
6	FAT vs ext2/3/4 vs Log-Structured File Systems	SUPPORTING	Three layout philosophies compared

Cluster mastery check: Draw an ext-style on-disk layout with block addresses, and trace read(fd, ...) through it to the final LBA.

Cluster 3: Crash Consistency

Order	Concept	Type	Focus
7	The Crash Consistency Problem	PRIMARY	Why simple updates leave the FS in partial states
8	Journaling: Write-Ahead Logs and Commit Ordering	PRIMARY	Data vs metadata journaling and commit discipline
9	Copy-on-Write File Systems: ZFS and Btrfs	SUPPORTING	Atomic root-pointer switches as an alternative

Cluster mastery check: For an append that spans three on-disk writes, trace what happens if the machine loses power between each pair, under both ext4 journaled and naive.

Cluster 4: Caching and Performance

Order	Concept	Type	Focus
10	Page Cache and Buffer Cache	PRIMARY	Unified caching and what "cached" really means
11	Read-Ahead, Write-Back, `fsync` Semantics	PRIMARY	When writes actually reach the platter
12	Sequential vs Random I/O Performance	SUPPORTING	The hundred-fold gap and why it shapes designs

Cluster mastery check: Explain the full life of a write that is dirtied, cached, written back 30 seconds later, and hit by a SIGKILL at minute 5.

Cluster 5: I/O Models and the Syscall Path

Order	Concept	Type	Focus
13	Blocking, Non-blocking, and `select` / `poll` / `epoll`	PRIMARY	Readiness-based I/O and its scaling characteristics
14	Asynchronous I/O: `aio_*` and `io_uring`	PRIMARY	True completion-based I/O on Linux
15	Device I/O: Drivers, Interrupts, DMA	SUPPORTING	What the OS does to move bytes to hardware

Cluster mastery check: Explain why epoll + non-blocking I/O scales beyond select, and why io_uring can still beat both under write-heavy workloads.

Then work these practice pages:

Order	Practice path	Focus
1	File System Internals Workshop	Inodes, directories, open-file tables, layout tracing
2	Crash Consistency Clinic	Recovery scenarios with and without journaling
3	I/O Performance Lab	Page cache, `fsync` cost, sequential vs random benchmarks
4	Code Katas: FUSE, fsync, epoll, Recovery	Implement a small FS, an epoll loop, measure and demonstrate

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.

Learning Objectives

By the end of this module you should be able to:

Draw the kernel data structures involved in open, read, write, close, and explain which structures are per-process, per-open-file, and per-inode.
Trace a byte offset in a user-space read down to a specific LBA on disk, through the VFS, file-system-specific code, page cache, block layer, and driver.
Explain the canonical on-disk layout of an ext-style file system, including superblock, inode table, data bitmap, and inode bitmap.
Contrast FAT-style, ext-style, and log-structured file systems in terms of allocation, metadata, and update cost.
State the crash consistency problem and show why naive in-place updates leave the FS in an inconsistent state.
Explain journaling (data and metadata modes) and how write-ahead logs with commit ordering restore consistency.
Explain copy-on-write file systems and why they achieve atomic updates without a separate journal.
Distinguish the page cache from the buffer cache, explain write-back and read-ahead, and state exactly what fsync guarantees.
Quantify the performance gap between sequential and random I/O on HDD and SSD, and justify a given design's I/O pattern.
Select between blocking I/O, select/poll, epoll, and io_uring for a given workload, and defend the choice against scaling and throughput constraints.

Outputs

a filesystem notebook with at least 10 traced syscalls annotated down to the on-disk layer
a crash recovery log: for each of at least 6 scenarios, record what the FS looks like post-crash with and without journaling
a small FUSE filesystem that implements open, read, write, readdir for a backing directory or memory map
a measured fsync cost report: throughput and latency for appends with and without fsync, on both SSD and (if available) HDD
an epoll-based echo server that handles at least 1,000 concurrent connections
an I/O benchmark report comparing sequential and random reads at 4 KiB, 64 KiB, and 1 MiB block sizes
a mistake journal of I/O-path confusions you want your future self not to repeat
a short write-up relating Module 4 persistence habits to the networking and distributed-systems modules ahead

Completion Standard

You have completed Module 4 when all of these are true:

you can describe what open("/a/b.txt") touches at every layer without looking anything up
you can explain when write returning does and does not mean "on disk"
you can reason about crash outcomes for append, rename, and truncate operations
you can write a minimal event loop with epoll and explain why select is the wrong tool at scale
you have measured, yourself, the cost of fsync on at least one real device
you have at least one real debugging narrative from running strace, iostat, or blktrace

If you can edit a file but cannot explain which structures on disk just changed, the module is not complete.

Reading Policy

Concept pages are the main path.
Local book chunks are selective reinforcement, not a second syllabus.
Read only if stuck means try the concept page, self-check, and drill first.
Optional deep dive means additional nuance or production-grade detail, not required progression.
External links (kernel.org, LWN, unixism.net, Brendan Gregg) are validated and stable; use them for current Linux behavior that textbooks predate.

Suggested Weekly Flow

Day	Work
1	Concepts 1-3, draw open-file table for three processes sharing a file
2	Concepts 4-5, trace `read` through VFS to LBA on paper
3	Concept 6 and Practice 1
4	Concepts 7-9, draw crash scenarios before and after each write
5	Practice 2 (crash recovery) and `fsync` measurement
6	Concepts 10-12 and Practice 3 (I/O performance)
7	Concepts 13-15
8	Practice 4 kata 1 (FUSE filesystem)
9	Practice 4 katas 2-4 (fsync cost, epoll loop, recovery demo)
10	Quiz, mistake-log cleanup, cross-module write-up

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.

Build Your Own X — elective

The Git tutorial is a content-addressable filesystem on top of a normal filesystem — the cleanest possible example of the abstractions this module covers. See Build Your Own X overview.

Rich Learning Pages

Scope of This Module​

Before You Start​

Diagnostic Interpretation​

What This Module Is For​

Concept Map​

How To Use This Module​

Cluster 1: The File Abstraction​

Cluster 2: On-Disk Layout and Structure​

Cluster 3: Crash Consistency​

Cluster 4: Caching and Performance​

Cluster 5: I/O Models and the Syscall Path​

Learning Objectives​

Outputs​

Completion Standard​

Reading Policy​

Suggested Weekly Flow​

Reference​

Rich Learning Pages​