Aside Optimizing Log Writes
This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.
Learning objectives
- Explain the main ideas and vocabulary in Aside Optimizing Log Writes.
- Work through the source examples for Aside Optimizing Log Writes without depending on raw chunk order.
- Use Aside Optimizing Log Writes as selective reference when learner modules point back to Ostep.
Prerequisites
- None curated yet.
Module targets
module-04-file-systems-io
AI companion modes
- Explain simply
- Socratic tutor
- Quiz me
- Challenge my understanding
- Diagnose my confusion
- Generate extra practice
- Revision mode
- Connect forward / backward
Source-of-truth note
This unit is anchored to Ostep and the source chapter "Aside Optimizing Log Writes". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.
External enrichment
No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.
Source provenance
- Primary source:
Ostep - Source chapter: Aside Optimizing Log Writes
- Raw source file:
213-aside-optimizinglogwrites.md
Merged source
Aside Optimizing Log Writes
ASIDE: OPTIMIZINGLOGWRITES
You may have noticed a particular inefficiency of writing to the log.
Namely, the file system first has to write out the transaction-begin block and contents of the transaction; only after these writes complete can the file system send the transaction-end block to disk. The performance impact is clear, if you think about how a disk works: usually an extra rotation is incurred (think about why).
One of our former graduate students, Vijayan Prabhakaran, had a simple idea to fix this problem [P+05]. When writing a transaction to the journal, include a checksum of the contents of the journal in the begin and end blocks. Doing so enables the file system to write the entire transaction at once, without incurring a wait; if, during recovery, the file system sees a mismatch in the computed checksum versus the stored checksum in the transaction, it can conclude that a crash occurred during the write of the transaction and thus discard the file-system update. Thus, with a small tweak in the write protocol and recovery system, a file system can achieve faster common-case performance; on top of that, the system is slightly more reliable, as any reads from the journal are now protected by a checksum.
This simple fix was attractive enough to gain the notice of Linux file system developers, who then incorporated it into the next generation Linux file system, called (you guessed it!) Linux ext4. It now ships on millions of machines worldwide, including the Android handheld platform.
Thus, every time you write to disk on many Linux-based systems, a little code developed at Wisconsin makes your system a little faster and more reliable.
To avoid this problem, the file system issues the transactional write in two steps. First, it writes all blocks except the TxE block to the journal, issuing these writes all at once. When these writes complete, the journal will look something like this (assuming our append workload again):
TxB I[v2] B[v2] Db
Journalid=1
When those writes complete, the file system issues the write of the TxE block, thus leaving the journal in this final, safe state:
TxB I[v2] B[v2] Db TxE
Journalid=1 id=1
An important aspect of this process is the atomicity guarantee provided by the disk. It turns out that the disk guarantees that any 512-byte write will either happen or not (and never be half-written); thus, to make sure the write of TxE is atomic, one should make it a single 512-byte block.
Thus, our current protocol to update the file system, with each of its three phases labeled:
- Journal write:Write the contents of the transaction (including TxB,
metadata, and data) to the log; wait for these writes to complete.
- Journal commit: Write the transaction commit block (containing
TxE) to the log; wait for write to complete; transaction is said to be committed.
- Checkpoint:Write the contents of the update (metadata and data)
to their final on-disk locations.
Recovery
Let's now understand how a file system can use the contents of the journal torecoverfrom a crash. A crash may happen at any time during this sequence of updates. If the crash happens before the transaction is written safely to the log (i.e., before Step 2 above completes), then our job is easy: the pending update is simply skipped. If the crash happens after the transaction has committed to the log, but before the checkpoint is complete, the file system canrecoverthe update as follows. When the system boots, the file system recovery process will scan the log and look for transactions that have committed to the disk; these transactions are thusreplayed(in order), with the file system again attempting to write out the blocks in the transaction to their final on-disk locations. This form of logging is one of the simplest forms there is, and is calledredo logging.
By recovering the committed transactions in the journal, the file system ensures that the on-disk structures are consistent, and thus can proceed by mounting the file system and readying itself for new requests.
Note that it is fine for a crash to happen at any point during checkpointing, even after some of the updates to the final locations of the blocks have completed. In the worst case, some of these updates are simply performed again during recovery. Because recovery is a rare operation (only taking place after an unexpected system crash), a few redundant writes are nothing to worry about3.
Batching Log Updates
You might have noticed that the basic protocol could add a lot of extra disk traffic. For example, imagine we create two files in a row, called file1andfile2, in the same directory. To create one file, one has to update a number of on-disk structures, minimally including: the inode bitmap (to allocated a new inode), the newly-created inode of the file, the 3Unless you worry about everything, in which case we can't help you. Stop worrying so much, it is unhealthy! But now you're probably worried about over-worrying.
data block of the parent directory containing the new directory entry, as well as the parent directory inode (which now has a new modification time). With journaling, we logically commit all of this information to the journal for each of our two file creations; because the files are in the same directory, and assuming they even have inodes within the same inode block, this means that if we're not careful, we'll end up writing these same blocks over and over.
To remedy this problem, some file systems do not commit each update to disk one at a time (e.g., Linux ext3); rather, one can buffer all updates into a global transaction. In our example above, when the two files are created, the file system just marks the in-memory inode bitmap, inodes of the files, directory data, and directory inode as dirty, and adds them to the list of blocks that form the current transaction. When it is finally time to write these blocks to disk (say, after a timeout of 5 seconds), this single global transaction is committed containing all of the updates described above. Thus, by buffering updates, a file system can avoid excessive write traffic to disk in many cases.
Making The Log Finite
We thus have arrived at a basic protocol for updating file-system on-disk structures. The file system buffers updates in memory for some time; when it is finally time to write to disk, the file system first carefully writes
out the details of the transaction to the journal (a.k.a. write-ahead log);
after the transaction is complete, the file system checkpoints those blocks to their final locations on disk.
However, the log is of a finite size. If we keep adding transactions to it (as in this figure), it will soon fill. What do you think happens then?
Tx1 Tx2 Tx3 Tx4 Tx5 ...
Journal
Two problems arise when the log becomes full. The first is simpler, but less critical: the larger the log, the longer recovery will take, as the recovery process must replay all the transactions within the log (in order) to recover. The second is more of an issue: when the log is full (or nearly full), no further transactions can be committed to the disk, thus making the file system "less than useful" (i.e., useless).
To address these problems, journaling file systems treat the log as a circular data structure, re-using it over and over; this is why the journal is sometimes referred to as acircular log. To do so, the file system must take action some time after a checkpoint. Specifically, once a transaction has been checkpointed, the file system should free the space it was occupying within the journal, allowing the log space to be reused. There are many ways to achieve this end; for example, you could simply mark the oldest and newest non-checkpointed transactions in the log in ajournal superblock; all other space is free. Here is a graphical depiction:
Journal
Tx1 Tx2 Tx3 Tx4 Tx5 ...
Super
Journal
In the journal superblock (not to be confused with the main file system superblock), the journaling system records enough information to know which transactions have not yet been checkpointed, and thus reduces recovery time as well as enables re-use of the log in a circular fashion. And thus we add another step to our basic protocol:
- Journal write:Write the contents of the transaction (containing TxB