Chapter 41: The Problem Poor Performance

This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.

Learning objectives

Explain the main ideas and vocabulary in The Problem Poor Performance.
Work through the source examples for The Problem Poor Performance without depending on raw chunk order.
Use The Problem Poor Performance as selective reference when learner modules point back to Ostep.

Prerequisites

Earlier prerequisite concepts leading into Chapter 41: The Problem Poor Performance.

Module targets

module-04-file-systems-io

AI companion modes

Explain simply
Socratic tutor
Quiz me
Challenge my understanding
Diagnose my confusion
Generate extra practice
Revision mode
Connect forward / backward

Source-of-truth note

This unit is anchored to Ostep and the source chapter "Chapter 41: The Problem Poor Performance". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.

External enrichment

No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.

Source provenance

Primary source: Ostep
Source chapter 41: Chapter 41: The Problem Poor Performance
Raw source file: 203-41-1-the-problem-poor-performance.md
Raw source file: 204-41-3-organizing-structure-the-cylinder-group.md
Raw source file: 205-41-4-policies-how-to-allocate-files-and-directories.md
Raw source file: 206-41-5-measuring-file-locality.md
Raw source file: 207-41-6-the-large-file-exception.md

Merged source

The Problem Poor Performance

41.1 The Problem: Poor Performance

41 Locality and The Fast File System

When the UNIXoperating system was first introduced, the UNIXwizard himself Ken Thompson wrote the first file system. Let's call that the "old

UNIXfile system", and it was really simple. Basically, its data structures looked like this on the disk:

S Inodes Data

The super block (S) contained information about the entire file system:

how big the volume is, how many inodes there are, a pointer to the head of a free list of blocks, and so forth. The inode region of the disk contained all the inodes for the file system. Finally, most of the disk was taken up by data blocks.

The good thing about the old file system was that it was simple, and supported the basic abstractions the file system was trying to deliver: files and the directory hierarchy. This easy-to-use system was a real step forward from the clumsy, record-based storage systems of the past, and the directory hierarchy was a true advance over simpler, one-level hierarchies provided by earlier systems.

The problem: performance was terrible. As measured by Kirk McKusick and his colleagues at Berkeley [MJLF84], performance started off bad and got worse over time, to the point where the file system was delivering only 2% of overall disk bandwidth!

The main issue was that the old UNIXfile system treated the disk like it was a random-access memory; data was spread all over the place without regard to the fact that the medium holding the data was a disk, and thus had real and expensive positioning costs. For example, the data blocks of a file were often very far away from its inode, thus inducing an expensive seek whenever one first read the inode and then the data blocks of a file (a pretty common operation).

Worse, the file system would end up getting quitefragmented, as the free space was not carefully managed. The free list would end up pointing to a bunch of blocks spread across the disk, and as files got allocated, they would simply take the next free block. The result was that a logically contiguous file would be accessed by going back and forth across the disk, thus reducing performance dramatically.

For example, imagine the following data block region, which contains four files (A, B, C, and D), each of size 2 blocks:

A1 A2 B1 B2 C1 C2 D1 D2

If B and D are deleted, the resulting layout is:

A1 A2 C1 C2

As you can see, the free space is fragmented into two chunks of two blocks, instead of one nice contiguous chunk of four. Let's say you now wish to allocate a file E, of size four blocks:

A1 A2 E1 E2 C1 C2 E3 E4

You can see what happens: E gets spread across the disk, and as a result, when accessing E, you don't get peak (sequential) performance from the disk. Rather, you first read E1 and E2, then seek, then read E3 and E4. This fragmentation problem happened all the time in the old

UNIXfile system, and it hurt performance. (A side note: this problem is exactly what disk defragmentation tools help with; they will reorganize on-disk data to place files contiguously and make free space one or a few contiguous regions, moving data around and then rewriting inodes and such to reflect the changes)

One other problem: the original block size was too small (512 bytes).

Thus, transferring data from the disk was inherently inefficient. Smaller blocks were good because they minimizedinternal fragmentation(waste within the block), but bad for transfer as each block might require a positioning overhead to reach it. Thus, the problem:

THECRUX:

HOWTOORGANIZEON-DISKDATATOIMPROVEPERFORMANCE

How can we organize file system data structures so as to improve performance? What types of allocation policies do we need on top of those data structures? How do we make the file system "disk aware"?

41.2 FFS: Disk Awareness Is The Solution

A group at Berkeley decided to build a better, faster file system, which they cleverly called theFast File System (FFS). The idea was to design the file system structures and allocation policies to be "disk aware" and thus improve performance, which is exactly what they did. FFS thus ushered in a new era of file system research; by keeping the sameinterface

to the file system (the same APIs, includingopen(), read(), write(),
close(), and other file system calls) but changing the internalimplemen-

tation, the authors paved the path for new file system construction, work that continues today. Virtually all modern file systems adhere to the existing interface (and thus preserve compatibility with applications) while changing their internals for performance, reliability, or other reasons.

Organizing Structure The Cylinder Group

41.3 Organizing Structure: The Cylinder Group

The first step was to change the on-disk structures. FFS divides the disk into a number ofcylinder groups. A singlecylinderis a set of tracks on different surfaces of a hard drive that are the same distance from the center of the drive; it is called a cylinder because of its clear resemblance to the so-called geometrical shape. FFS aggregates eachNconsecutive cylinders into group, and thus the entire disk can thus be viewed as a collection of cylinder groups. Here is a simple example, showing the four outer most tracks of a drive with six platters, and a cylinder group that consists of three cylinders:

Single track (e.g., dark gray)

Cylinder:

Cylinder Group:

(all tracks with same color)

Set of N consecutive cylinders of drive across different surfaces

Tracks at same distance from center

(if N=3, first group does not include black track)

Note that modern drives do not export enough information for the

file system to truly understand whether a particular cylinder is in use; as discussed previously [AD14], disks export a logical address space of blocks and hide details of their geometry from clients. Thus, modern file systems (such as Linux ext2, ext3, and ext4) instead organize the drive intoblock groups, each of which is just a consecutive portion of the disk's address space. The picture below illustrates an example where every 8 blocks are organized into a different block group (note that real groups would consist of many more blocks):

Group 0 Group 1 Group 2

Whether you call them cylinder groups or block groups, these groups are the central mechanism that FFS uses to improve performance. Critically, by placing two files within the same group, FFS can ensure that accessing one after the other will not result in long seeks across the disk.

To use these groups to store files and directories, FFS needs to have the ability to place files and directories into a group, and track all necessary information about them therein. To do so, FFS includes all the structures you might expect a file system to have within each group, e.g., space for inodes, data blocks, and some structures to track whether each of those are allocated or free. Here is a depiction of what FFS keeps within a single cylinder group:

S ib db Inodes Data

Let's now examine the components of this single cylinder group in more detail. FFS keeps a copy of thesuper block(S) in each group for reliability reasons. The super block is needed to mount the file system;

by keeping multiple copies, if one copy becomes corrupt, you can still mount and access the file system by using a working replica.

Within each group, FFS needs to track whether the inodes and data blocks of the group are allocated. A per-groupinode bitmap(ib) and data bitmap(db) serve this role for inodes and data blocks in each group.

Bitmaps are an excellent way to manage free space in a file system because it is easy to find a large chunk of free space and allocate it to a file, perhaps avoiding some of the fragmentation problems of the free list in the old file system.

Finally, theinodeanddata blockregions are just like those in the previous very-simple file system (VSFS). Most of each cylinder group, as usual, is comprised of data blocks.

Policies How To Allocate Files And Directories

41.4 Policies: How To Allocate Files and Directories

ASIDE: FFS FILECREATION

As an example, think about what data structures must be updated when a file is created; assume, for this example, that the user creates a new file

/foo/bar.txtand that the file is one block long (4KB). The file is new, and thus needs a new inode; thus, both the inode bitmap and the newlyallocated inode will be written to disk. The file also has data in it and thus it too must be allocated; the data bitmap and a data block will thus (eventually) be written to disk. Hence, at least four writes to the current cylinder group will take place (recall that these writes may be buffered in memory for a while before they take place). But this is not all! In particular, when creating a new file, you must also place the file in the file-system hierarchy, i.e., the directory must be updated. Specifically, the parent directoryfoomust be updated to add the entry forbar.txt; this update may fit in an existing data block offooor require a new block to be allocated (with associated data bitmap). The inode offoomust also be updated, both to reflect the new length of the directory as well as to update time fields (such as last-modified-time). Overall, it is a lot of work just to create a new file! Perhaps next time you do so, you should be more thankful, or at least surprised that it all works so well.

With this group structure in place, FFS now has to decide how to place files and directories and associated metadata on disk to improve performance. The basic mantra is simple:keep related stuff together(and its corollary,keep unrelated stuff far apart).

Thus, to obey the mantra, FFS has to decide what is "related" and place it within the same block group; conversely, unrelated items should be placed into different block groups. To achieve this end, FFS makes use of a few simple placement heuristics.

The first is the placement of directories. FFS employs a simple approach: find the cylinder group with a low number of allocated directories (to balance directories across groups) and a high number of free inodes (to subsequently be able to allocate a bunch of files), and put the directory data and inode in that group. Of course, other heuristics could be used here (e.g., taking into account the number of free data blocks).

For files, FFS does two things. First, it makes sure (in the general case) to allocate the data blocks of a file in the same group as its inode, thus preventing long seeks between inode and data (as in the old file system). Second, it places all files that are in the same directory in the cylinder group of the directory they are in. Thus, if a user creates four files,

/dir1/1.txt,/dir1/2.txt,/dir1/3.txt, and/dir99/4.txt, FFS would try to place the first three near one another (same group) and the fourth far away (in some other group).

FFS Locality 100%

Trace

Random 80% 60% 40%

Cumulative Frequency 20% 0% 0 2 4 6 8 10

Path Difference

Figure 41.1:FFS Locality For SEER Traces

These heuristics are not based on extensive studies of file-system traffic or anything particularly nuanced; rather, they are based on good oldfashioned common sense (isn't that what CS stands for after all?). Files in a directoryareoften accessed together (imagine compiling a bunch of files and then linking them into a single executable). Because they are, FFS will often improve performance, making sure that seeks between related files are short and thus improving performance.

Measuring File Locality

41.5 Measuring File Locality

To understand better whether these heuristics make sense, let's analyze some traces of file system access and see if indeed there is namespace locality. For some reason, there doesn't seem to be a good study of this topic in the literature.

Specifically, we'll use the SEER traces [K94] and analyze how "far away" file accesses were from one another in the directory tree. For example, if file fis opened, and then re-opened next in the trace (before any other files are opened), the distance between these two opens in the directory tree is zero (as they are the same file). If a file fin directory dir(i.e.,dir/f) is opened, and followed by an open of filegin the same directory (i.e.,dir/g), the distance between the two file accesses is one, as they share the same directory but are not the same file. Our distance metric, in other words, measures how far up the directory tree you have to travel to find thecommon ancestorof two files; the closer they are in the tree, the lower the metric.

Figure 41.1 shows the locality observed in the SEER traces over all workstations in the SEER cluster over the entirety of all traces. The graph plots the difference metric along the x-axis, and shows the cumulative percentage of file opens that were of that difference along the y-axis.

Specifically, for the SEER traces (marked "Trace" in the graph), you can see that about 7% of file accesses were to the file that was opened previously, and that nearly 40% of file accesses were to either the same file or to one in the same directory (i.e., a difference of zero or one). Thus, the

FFS locality assumption seems to make sense (at least for these traces).

Interestingly, another 25% or so of file accesses were to files that had a distance of two. This type of locality occurs when the user has structured a set of related directories in a multi-level fashion and consistently jumps between them. For example, if a user has a srcdirectory and builds object files (.ofiles) into anobjdirectory, and both of these directories are sub-directories of a main projdirectory, a common access pattern will beproj/src/foo.cfollowed byproj/obj/foo.o. The distance between these two accesses is two, asprojis the common ancestor. FFS doesnotcapture this type of locality in its policies, and thus more seeking will occur between such accesses.

For comparison, the graph also shows locality for a "Random" trace.

The random trace was generated by selecting files from within an existing

SEER trace in random order, and calculating the distance metric between these randomly-ordered accesses. As you can see, there is less namespace locality in the random traces, as expected. However, because eventually every file shares a common ancestor (e.g., the root), there is some locality, and thus random is useful as a comparison point.

The Large File Exception

41.6 The Large-File Exception

In FFS, there is one important exception to the general policy of file placement, and it arises for large files. Without a different rule, a large file would entirely fill the block group it is first placed within (and maybe others). Filling a block group in this manner is undesirable, as it prevents subsequent "related" files from being placed within this block group, and thus may hurt file-access locality.

Thus, for large files, FFS does the following. After some number of blocks are allocated into the first block group (e.g., 12 blocks, or the number of direct pointers available within an inode), FFS places the next "large" chunk of the file (e.g., those pointed to by the first indirect block) in another block group (perhaps chosen for its low utilization). Then, the next chunk of the file is placed in yet another different block group, and so on.

Let's look at some pictures to understand this policy better. Without the large-file exception, a single large file would place all of its blocks into one part of the disk. Let's use a small example of a file with 10 blocks to illustrate the behavior visually.

Here is the depiction of FFS without the large-file exception:

G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 01234 56789

With the large-file exception, you might see something more like this, with the file spread across the disk in chunks:

G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 89 01 23 45 67

The astute reader (that's you) will note that spreading blocks of a file across the disk will hurt performance, particularly in the relatively common case of sequential file access (e.g., when a user or application reads chunks 0 through 9 in order). And you are right! But you can address this problem by choosing chunk size carefully.

Specifically, if the chunk size is large enough, the file system will spend most of its time transferring data from disk and just a (relatively) little time seeking between chunks of the block. This process of reducing an overhead by doing more work per overhead paid is calledamortization and is a common technique in computer systems.

Let's do an example: assume that the average positioning time (i.e., seek and rotation) for a disk is 10 ms. Assume further that the disk transfers data at 40 MB/s. If your goal was to spend half our time seeking between chunks and half our time transferring data (and thus achieve 50% of peak disk performance), you would thus need to spend 10 ms transferring data for every 10 ms positioning. So the question becomes:

how big does a chunk have to be in order to spend 10 ms in transfer?

Easy, just use our old friend, math, in particular the dimensional analysis mentioned in the chapter on disks [AD14]:

40✘✘M B 1024KB 1✟sec✟

· · · 10✟ms✟= 409.6KB (41.1)

✟sec✟ 1✘✘M B 1000✟ms✟

Basically, what this equation says is this: if you transfer data at 40

MB/s, you need to transfer only 409.6KB every time you seek in order to spend half your time seeking and half your time transferring. Similarly, you can compute the size of the chunk you would need to achieve 90% of peak bandwidth (turns out it is about 3.69MB), or even 99% of peak bandwidth (40.6MB!). As you can see, the closer you want to get to peak, the bigger these chunks get (see Figure 41.2 for a plot of these values).

FFS did not use this type of calculation in order to spread large files across groups, however. Instead, it took a simple approach, based on the structure of the inode itself. The first twelve direct blocks were placed in the same group as the inode; each subsequent indirect block, and all the blocks it pointed to, was placed in a different group. With a block size of 4KB, and 32-bit disk addresses, this strategy implies that every 1024 blocks of the file (4MB) were placed in separate groups, the lone exception being the first 48KB of the file as pointed to by direct pointers.

The Challenges of Amortization

10M 90%, 3.69M 1M 50%, 409.6K 32K

Log(Chunk Size Needed) 1K 0% 25% 50% 75% 100%

Percent Bandwidth (Desired)

Figure 41.2:Amortization: How Big Do Chunks Have To Be?

Note that the trend in disk drives is that transfer rate improves fairly

rapidly, as disk manufacturers are good at cramming more bits into the same surface, but the mechanical aspects of drives related to seeks (disk arm speed and the rate of rotation) improve rather slowly [P98]. The implication is that over time, mechanical costs become relatively more expensive, and thus, to amortize said costs, you have to transfer more data between seeks.

41.7 A Few Other Things About FFS

FFS introduced a few other innovations too. In particular, the designers were extremely worried about accommodating small files; as it turned out, many files were 2KB or so in size back then, and using 4KB blocks, while good for transferring data, was not so good for space efficiency.

Thisinternal fragmentationcould thus lead to roughly half the disk being wasted for a typical file system.

The solution the FFS designers hit upon was simple and solved the problem. They decided to introducesub-blocks, which were 512-byte little blocks that the file system could allocate to files. Thus, if you created a small file (say 1KB in size), it would occupy two sub-blocks and thus not waste an entire 4KB block. As the file grew, the file system will continue allocating 512-byte blocks to it until it acquires a full 4KB of data. At that point, FFS will find a 4KB block,copythe sub-blocks into it, and free the sub-blocks for future use.

9 8 10 4 7 11 9

Spindle Spindle 6 0 3 0 5 1 8 6 4 2 2 1 3 7

Learning objectives​

Prerequisites​

Module targets​

AI companion modes​

Source-of-truth note​

External enrichment​

Source provenance​

Merged source​

The Problem Poor Performance​

41.1 The Problem: Poor Performance

41 Locality and The Fast File System​

41.2 FFS: Disk Awareness Is The Solution​

Organizing Structure The Cylinder Group​