Chapter 42: A Detailed Example
This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.
Learning objectives
- Explain the main ideas and vocabulary in A Detailed Example.
- Work through the source examples for A Detailed Example without depending on raw chunk order.
- Use A Detailed Example as selective reference when learner modules point back to Ostep.
Prerequisites
- Earlier prerequisite concepts leading into Chapter 42: A Detailed Example.
Module targets
module-04-file-systems-io
AI companion modes
- Explain simply
- Socratic tutor
- Quiz me
- Challenge my understanding
- Diagnose my confusion
- Generate extra practice
- Revision mode
- Connect forward / backward
Source-of-truth note
This unit is anchored to Ostep and the source chapter "Chapter 42: A Detailed Example". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.
External enrichment
No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.
Source provenance
- Primary source:
Ostep - Source chapter 42: Chapter 42: A Detailed Example
- Raw source file:
209-42-1-a-detailed-example.md - Raw source file:
210-42-2-solution-1-the-file-system-checker.md - Raw source file:
215-42-4-solution-3-other-approaches.md
Merged source
A Detailed Example
42.1 A Detailed Example
42 Crash Consistency: FSCK and Journaling
As we've seen thus far, the file system manages a set of data structures to implement the expected abstractions: files, directories, and all of the other metadata needed to support the basic abstraction that we expect from a file system. Unlike most data structures (for example, those found in memory of a running program), file system data structures mustpersist, i.e., they must survive over the long haul, stored on devices that retain data despite power loss (such as hard disks or flash-based SSDs).
One major challenge faced by a file system is how to update persistent data structures despite the presence of apower lossorsystem crash.
Specifically, what happens if, right in the middle of updating on-disk structures, someone trips over the power cord and the machine loses power? Or the operating system encounters a bug and crashes? Because of power losses and crashes, updating a persistent data structure can be quite tricky, and leads to a new and interesting problem in file system implementation, known as thecrash-consistency problem.
This problem is quite simple to understand. Imagine you have to update two on-disk structures,AandB, in order to complete a particular operation. Because the disk only services a single request at a time, one of these requests will reach the disk first (either AorB). If the system crashes or loses power after one write completes, the on-disk structure will be left in aninconsistentstate. And thus, we have a problem that all file systems need to solve:
THECRUX: HOWTOUPDATETHEDISKDESPITECRASHES
The system may crash or lose power between any two writes, and thus the on-disk state may only partially get updated. After the crash, the system boots and wishes to mount the file system again (in order to access files and such). Given that crashes can occur at arbitrary points in time, how do we ensure the file system keeps the on-disk image in a reasonable state?
1
In this chapter, we'll describe this problem in more detail, and look at some methods file systems have used to overcome it. We'll begin by examining the approach taken by older file systems, known asfsckor the file system checker. We'll then turn our attention to another approach, known asjournaling(also known aswrite-ahead logging), a technique which adds a little bit of overhead to each write but recovers more quickly from crashes or power losses. We will discuss the basic machinery of journaling, including a few different flavors of journaling that Linux ext3
[T98,PAA05] (a relatively modern journaling file system) implements.
To kick off our investigation of journaling, let's look at an example.
We'll need to use a workloadthat updates on-disk structures in some way. Assume here that the workload is simple: the append of a single data block to an existing file. The append is accomplished by opening the file, callinglseek()to move the file offset to the end of the file, and then issuing a single 4KB write to the file before closing it.
Let's also assume we are using standard simple file system structures on the disk, similar to file systems we have seen before. This tiny example includes aninode bitmap(with just 8 bits, one per inode), adata bitmap (also 8 bits, one per data block), inodes (8 total, numbered 0 to 7, and spread across four blocks), and data blocks (8 total, numbered 0 to 7).
Here is a diagram of this file system:
Inode Data
Inodes Data Blocks
Bmap Bmap
I[v1]
Da
If you look at the structures in the picture, you can see that a single inode is allocated (inode number 2), which is marked in the inode bitmap, and a single allocated data block (data block 4), also marked in the data bitmap.
The inode is denoted I[v1], as it is the first version of this inode; it will soon be updated (due to the workload described above).
Let's peek inside this simplified inode too. Inside of I[v1], we see:
owner : remzi permissions : read-write size : 1 pointer : 4 pointer : null pointer : null pointer : null
In this simplified inode, thesizeof the file is 1(it has one block allocated), the first direct pointer points to block4(the first data block of the file, Da), and all three other direct pointers are set tonull(indicating that they are not used). Of course, real inodes have many more fields; see previous chapters for more information.
When we append to the file, we are adding a new data block to it, and thus must update three on-disk structures: the inode (which must point to the new block as well as have a bigger size due to the append), the new data block Db, and a new version of the data bitmap (call it B[v2]) to indicate that the new data block has been allocated.
Thus, in the memory of the system, we have three blocks which we must write to disk. The updated inode (inode version 2, or I[v2] for short) now looks like this:
owner : remzi permissions : read-write size : 2 pointer : 4 pointer : 5 pointer : null pointer : null
The updated data bitmap (B[v2]) now looks like this: 00001100. Finally, there is the data block (Db), which is just filled with whatever it is users put into files. Stolen music perhaps?
What we would like is for the final on-disk image of the file system to look like this:
Inode Data
Inodes Data Blocks
Bmap Bmap
I[v2]
Da Db
To achieve this transition, the file system must perform three separate writes to the disk, one each for the inode (I[v2]), bitmap (B[v2]), and data block (Db). Note that these writes usually don't happen immediately when the user issues awrite()system call; rather, the dirty inode, bitmap, and new data will sit in main memory (in thepage cache orbuffer cache) for some time first; then, when the file system finally decides to write them to disk (after say 5 seconds or 30 seconds), the file system will issue the requisite write requests to the disk. Unfortunately, a crash may occur and thus interfere with these updates to the disk. In particular, if a crash happens after one or two of these writes have taken place, but not all three, the file system could be left in a funny state.
Crash Scenarios
To understand the problem better, let's look at some example crash scenarios. Imagine only a single write succeeds; there are thus three possible outcomes, which we list here:
-
Just the data block (Db) is written to disk. In this case, the data is on disk, but there is no inode that points to it and no bitmap that even says the block is allocated. Thus, it is as if the write never occurred. This case is not a problem at all, from the perspective of file-system crash consistency1.
-
Just the updated inode (I[v2]) is written to disk. In this case, the inode points to the disk address (5) where Db was about to be written, but Db has not yet been written there. Thus, if we trust that pointer, we will readgarbagedata from the disk (the old contents of disk address 5). Further, we have a new problem, which we call afile-system inconsistency. The on-disk bitmap is telling us that data block 5 has not been allocated, but the inode is saying that it has. This disagreement in the file system data structures is an inconsistency in the data structures of the file system; to use the file system, we must somehow resolve this problem (more on that below).
-
Just the updated bitmap (B[v2]) is written to disk.In this case, the bitmap indicates that block 5 is allocated, but there is no inode that points to it. Thus the file system is inconsistent again; if left unresolved, this write would result in aspace leak, as block 5 would never be used by the file system. There are also three more crash scenarios in this attempt to write three blocks to disk. In these cases, two writes succeed and the last one fails:
-
The inode (I[v2]) and bitmap (B[v2]) are written to disk, but not data (Db). In this case, the file system metadata is completely consistent: the inode has a pointer to block 5, the bitmap indicates that 5 is in use, and thus everything looks OK from the perspective of the file system's metadata. But there is one problem: 5 has garbage in it again.
-
The inode (I[v2]) and the data block (Db) are written, but not the bitmap (B[v2]).In this case, we have the inode pointing to the correct data on disk, but again have an inconsistency between the inode and the old version of the bitmap (B1). Thus, we once again need to resolve the problem before using the file system.
-
The bitmap (B[v2]) and data block (Db) are written, but not the inode (I[v2]).In this case, we again have an inconsistency between the inode and the data bitmap. However, even though the block was written and the bitmap indicates its usage, we have no idea which file it belongs to, as no inode points to the file. 1However, it might be a problem for the user, who just lost some data!
Solution 1 The File System Checker
42.2 Solution #1: The File System Checker
The Crash Consistency Problem
Hopefully, from these crash scenarios, you can see the many problems that can occur to our on-disk file system image because of crashes: we can have inconsistency in file system data structures; we can have space leaks; we can return garbage data to a user; and so forth. What we'd like to do ideally is move the file system from one consistent state (e.g., before the file got appended to) to anotheratomically(e.g., after the inode, bitmap, and new data block have been written to disk). Unfortunately, we can't do this easily because the disk only commits one write at a time, and crashes or power loss may occur between these updates. We call this general problem thecrash-consistency problem(we could also call it the consistent-update problem).
Early file systems took a simple approach to crash consistency. Basically, they decided to let inconsistencies happen and then fix them later (when rebooting). A classic example of this lazy approach is found in a tool that does this: fsck2. fsckis a U
NIXtool for finding such inconsistencies and repairing them [M86]; similar tools to check and repair a disk partition exist on different systems. Note that such an approach can't fix all problems; consider, for example, the case above where the file system looks consistent but the inode points to garbage data. The only real goal is to make sure the file system metadata is internally consistent.
The tool fsckoperates in a number of phases, as summarized in
McKusick and Kowalski's paper [MK96]. It is runbefore the file system is mounted and made available (fsckassumes that no other file-system activity is on-going while it runs); once finished, the on-disk file system should be consistent and thus can be made accessible to users.
Here is a basic summary of whatfsckdoes:
-
Superblock: fsckfirst checks if the superblock looks reasonable, mostly doing sanity checks such as making sure the file system size is greater than the number of blocks allocated. Usually the goal of these sanity checks is to find a suspect (corrupt) superblock; in this case, the system (or administrator) may decide to use an alternate copy of the superblock.
-
Free blocks: Next,fsckscans the inodes, indirect blocks, double indirect blocks, etc., to build an understanding of which blocks are currently allocated within the file system. It uses this knowledge to produce a correct version of the allocation bitmaps; thus, if there is any inconsistency between bitmaps and inodes, it is resolved by trusting the information within the inodes. The same type of check is performed for all the inodes, making sure that all inodes that look like they are in use are marked as such in the inode bitmaps. 2Pronounced either "eff-ess-see-kay", "eff-ess-check", or, if you don't like the tool, "effsuck". Yes, serious professional people use this term.
-
Inode state: Each inode is checked for corruption or other prob- lems. For example,fsckmakes sure that each allocated inode has a valid type field (e.g., regular file, directory, symbolic link, etc.). If there are problems with the inode fields that are not easily fixed, the inode is considered suspect and cleared byfsck; the inode bitmap is correspondingly updated.
-
Inode links: fsckalso verifies the link count of each allocated in- ode. As you may recall, the link count indicates the number of different directories that contain a reference (i.e., a link) to this particular file. To verify the link count, fsckscans through the entire directory tree, starting at the root directory, and builds its own link counts for every file and directory in the file system. If there is a mismatch between the newly-calculated count and that found within an inode, corrective action must be taken, usually by fixing the count within the inode. If an allocated inode is discovered but no directory refers to it, it is moved to thelost+founddirectory.
-
Duplicates:fsckalso checks for duplicate pointers, i.e., cases where two different inodes refer to the same block. If one inode is obviously bad, it may be cleared. Alternately, the pointed-to block could be copied, thus giving each inode its own copy as desired.
-
Bad blocks:A check for bad block pointers is also performed while scanning through the list of all pointers. A pointer is considered "bad" if it obviously points to something outside its valid range, e.g., it has an address that refers to a block greater than the partition size. In this case,fsckcan't do anything too intelligent; it just removes (clears) the pointer from the inode or indirect block.
-
Directory checks: fsckdoes not understand the contents of user files; however, directories hold specifically formatted information created by the file system itself. Thus,fsckperforms additional integrity checks on the contents of each directory, making sure that "." and ".." are the first entries, that each inode referred to in a directory entry is allocated, and ensuring that no directory is linked to more than once in the entire hierarchy. As you can see, building a workingfsckrequires intricate knowledge of the file system; making sure such a piece of code works correctly in all cases can be challenging [G+08]. However, fsck (and similar approaches) have a bigger and perhaps more fundamental problem: they aretoo slow.
With a very large disk volume, scanning the entire disk to find all the allocated blocks and read the entire directory tree may take many minutes or hours. Performance offsck, as disks grew in capacity and RAIDs grew in popularity, became prohibitive (despite recent advances [M+13]).
At a higher level, the basic premise of fsckseems just a tad irrational. Consider our example above, where just three blocks are written to the disk; it is incredibly expensive to scan the entire disk to fix problems that occurred during an update of just three blocks. This situation is akin to dropping your keys on the floor in your bedroom, and then commencing asearch-the-entire-house-for-keys recovery algorithm, starting in the basement and working your way through every room. It works but is wasteful. Thus, as disks (and RAIDs) grew, researchers and practitioners started to look for other solutions.
42.3 Solution #2: Journaling (or Write-Ahead Logging)
Probably the most popular solution to the consistent update problem is to steal an idea from the world of database management systems. That idea, known aswrite-ahead logging, was invented to address exactly this type of problem. In file systems, we usually call write-ahead loggingjournalingfor historical reasons. The first file system to do this was Cedar
[H87], though many modern file systems use the idea, including Linux ext3 and ext4, reiserfs, IBM's JFS, SGI's XFS, and Windows NTFS.
The basic idea is as follows. When updating the disk, before overwriting the structures in place, first write down a little note (somewhere else on the disk, in a well-known location) describing what you are about to do. Writing this note is the "write ahead" part, and we write it to a structure that we organize as a "log"; hence, write-ahead logging.
By writing the note to disk, you are guaranteeing that if a crash takes places during the update (overwrite) of the structures you are updating, you can go back and look at the note you made and try again; thus, you will know exactly what to fix (and how to fix it) after a crash, instead of having to scan the entire disk. By design, journaling thus adds a bit of work during updates to greatly reduce the amount of work required during recovery.
We'll now describe howLinux ext3, a popular journaling file system, incorporates journaling into the file system. Most of the on-disk structures are identical toLinux ext2, e.g., the disk is divided into block groups, and each block group has an inode and data bitmap as well as inodes and data blocks. The new key structure is the journal itself, which occupies some small amount of space within the partition or on another device.
Thus, an ext2 file system (without journaling) looks like this:
Super Group 0 Group 1 . . . Group N
Assuming the journal is placed within the same file system image (though sometimes it is placed on a separate device, or as a file within the file system), an ext3 file system with a journal looks like this:
Super Journal Group 0 Group 1 . . . Group N
The real difference is just the presence of the journal, and of course, how it is used.
Solution 3 Other Approaches
42.4 Solution #3: Other Approaches
We've thus far described two options in keeping file system metadata consistent: a lazy approach based onfsck, and a more active approach known as journaling. However, these are not the only two approaches.
One such approach, known as Soft Updates [GP94], was introduced by
Ganger and Patt. This approach carefully orders all writes to the file system to ensure that the on-disk structures are never left in an inconsistent state. For example, by writing a pointed-to data block to diskbefore the inode that points to it, we can ensure that the inode never points to garbage; similar rules can be derived for all the structures of the file system. Implementing Soft Updates can be a challenge, however; whereas the journaling layer described above can be implemented with relatively little knowledge of the exact file system structures, Soft Updates requires intricate knowledge of each file system data structure and thus adds a fair amount of complexity to the system.
Another approach is known ascopy-on-write(yes,COW), and is used
in a number of popular file systems, including Sun's ZFS [B07]. This technique never overwrites files or directories in place; rather, it places new updates to previously unused locations on disk. After a number of updates are completed, COW file systems flip the root structure of the file system to include pointers to the newly updated structures. Doing so makes keeping the file system consistent straightforward. We'll be learning more about this technique when we discuss the log-structured file system (LFS) in a future chapter; LFS is an early example of a COW.
Another approach is one we just developed here at Wisconsin. In this technique, entitledbackpointer-based consistency(orBBC), no ordering is enforced between writes. To achieve consistency, an additionalback pointer is added to every block in the system; for example, each data block has a reference to the inode to which it belongs. When accessing a file, the file system can determine if the file is consistent by checking if the forward pointer (e.g., the address in the inode or direct block) points to a block that refers back to it. If so, everything must have safely reached disk and thus the file is consistent; if not, the file is inconsistent, and an error is returned. By adding back pointers to the file system, a new form of lazy crash consistency can be attained [C+12].
Finally, we also have explored techniques to reduce the number of times a journal protocol has to wait for disk writes to complete. Entitled optimistic crash consistency[C+13], this new approach issues as many writes to disk as possible and uses a generalized form of thetransaction checksum[P+05], as well as a few other techniques, to detect inconsistencies should they arise. For some workloads, these optimistic techniques can improve performance by an order of magnitude. However, to truly function well, a slightly different disk interface is required [C+13].
42.5 Summary
We have introduced the problem of crash consistency, and discussed various approaches to attacking this problem. The older approach of building a file system checker works but is likely too slow to recover on modern systems. Thus, many file systems now use journaling. Journaling reduces recovery time from O(size-of-the-disk-volume) to O(size-of-thelog), thus speeding recovery substantially after a crash and restart. For this reason, many modern file systems use journaling. We have also seen that journaling can come in many different forms; the most commonly used is ordered metadata journaling, which reduces the amount of traffic to the journal while still preserving reasonable consistency guarantees for both file system metadata as well as user data.