Chapter 19: TLB Basic Algorithm

This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.

Learning objectives

Explain the main ideas and vocabulary in TLB Basic Algorithm.
Work through the source examples for TLB Basic Algorithm without depending on raw chunk order.
Use TLB Basic Algorithm as selective reference when learner modules point back to Ostep.

Prerequisites

Earlier prerequisite concepts leading into Chapter 19: TLB Basic Algorithm.

Module targets

module-01-processes-scheduling
module-02-memory-management-virtual-memory

AI companion modes

Explain simply
Socratic tutor
Quiz me
Challenge my understanding
Diagnose my confusion
Generate extra practice
Revision mode
Connect forward / backward

Source-of-truth note

This unit is anchored to Ostep and the source chapter "Chapter 19: TLB Basic Algorithm". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.

External enrichment

No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.

Source provenance

Primary source: Ostep
Source chapter 19: Chapter 19: TLB Basic Algorithm
Raw source file: 087-19-1-tlb-basic-algorithm.md
Raw source file: 088-19-2-example-accessing-an-array.md
Raw source file: 090-19-5-tlb-issue-context-switches.md
Raw source file: 091-19-7-a-real-tlb-entry.md

Merged source

TLB Basic Algorithm

19.1 TLB Basic Algorithm

Figure 19.1 shows a rough sketch of how hardware might handle a virtual address translation, assuming a simplelinear page table(i.e., the page table is an array) and ahardware-managed TLB(i.e., the hardware handles much of the responsibility of page table accesses; we'll explain more about this below).

if (Success == True) // TLB Hit
if (CanAccess(TlbEntry.ProtectBits) == True)
Offset = VirtualAddress & OFFSET_MASK
PhysAddr = (TlbEntry.PFN << SHIFT) | Offset
AccessMemory(PhysAddr)
else
RaiseException(PROTECTION_FAULT)
else // TLB Miss
PTEAddr = PTBR + (VPN * sizeof(PTE))
PTE = AccessMemory(PTEAddr)
if (PTE.Valid == False)
RaiseException(SEGMENTATION_FAULT)
else if (CanAccess(PTE.ProtectBits) == False)
RaiseException(PROTECTION_FAULT)
else
TLB_Insert(VPN, PTE.PFN, PTE.ProtectBits)
RetryInstruction()

Figure 19.1:TLB Control Flow Algorithm

The algorithm the hardware follows works like this: first, extract the virtual page number (VPN) from the virtual address (Line 1 in Figure 19.1), and check if the TLB holds the translation for this VPN (Line 2). If it does, we have aTLB hit, which means the TLB holds the translation. Success!

We can now extract the page frame number (PFN) from the relevant TLB entry, concatenate that onto the offset from the original virtual address, and form the desired physical address (PA), and access memory (Lines 5-7), assuming protection checks do not fail (Line 4).

If the CPU does not find the translation in the TLB (aTLB miss), we have some more work to do. In this example, the hardware accesses the page table to find the translation (Lines 11-12), and, assuming that the virtual memory reference generated by the process is valid and accessible (Lines 13, 15), updates the TLB with the translation (Line 18). These set of actions are costly, primarily because of the extra memory reference needed to access the page table (Line 12). Finally, once the TLB is updated, the hardware retries the instruction; this time, the translation is found in the TLB, and the memory reference is processed quickly.

The TLB, like all caches, is built on the premise that in the common case, translations are found in the cache (i.e., are hits). If so, little overhead is added, as the TLB is found near the processing core and is designed to be quite fast. When a miss occurs, the high cost of paging is incurred; the page table must be accessed to find the translation, and an extra memory reference (or more, with more complex page tables) results.

If this happens often, the program will likely run noticeably more slowly; memory accesses, relative to most CPU instructions, are quite costly, and

TLB misses lead to more memory accesses. Thus, it is our hope to avoid

TLB misses as much as we can.

Offset 00 04 08 12

VPN = 00
VPN = 01
VPN = 02
VPN = 03
VPN = 04
VPN = 05
VPN =
a[0] a[1] a[2]
VPN =
a[3] a[4] a[5] a[6]
VPN =
a[7] a[8] a[9]
VPN = 09
VPN = 10
VPN = 11
VPN = 12
VPN = 13
VPN = 14
VPN = 15

Figure 19.2:Example: An Array In A Tiny Address Space

Example Accessing An Array

19.2 Example: Accessing An Array

To make clear the operation of a TLB, let's examine a simple virtual address trace and see how a TLB can improve its performance. In this

example, let's assume we have an array of 10 4-byte integers in memory, starting at virtual address 100. Assume further that we have a small 8-bit virtual address space, with 16-byte pages; thus, a virtual address breaks down into a 4-bit VPN (there are 16 virtual pages) and a 4-bit offset (there are 16 bytes on each of those pages).

Figure 19.2 shows the array laid out on the 16 16-byte pages of the system. As you can see, the array's first entry (a[0]) begins on (VPN=06, offset=04); only three 4-byte integers fit onto that page. The array continues

onto the next page (VPN=07), where the next four entries (a[3]... a[6])

are found. Finally, the last three entries of the 10-entry array (a[7]...a[9])

are located on the next page of the address space (VPN=08).

Now let's consider a simple loop that accesses each array element, something that would look like this in C:

int sum = 0;
for (i = 0; i < 10; i++) {
sum += a[i];
}
For the sake of simplicity, we will pretend that the only memory accesses the loop generates are to the array (ignoring the variablesiand sum, as well as the instructions themselves). When the first array element (a[0]) is accessed, the CPU will see a load to virtual address 100. The
hardware extracts the VPN from this (VPN=06), and uses that to check

the TLB for a valid translation. Assuming this is the first time the program accesses the array, the result will be a TLB miss.

The next access is toa[1], and there is some good news here: a TLB hit! Because the second element of the array is packed next to the first, it lives on the same page; because we've already accessed this page when accessing the first element of the array, the translation is already loaded into the TLB. And hence the reason for our success. Access toa[2]encounters similar success (another hit), because it too lives on the same page asa[0]anda[1].

Unfortunately, when the program accesses a[3], we encounter another TLB miss. However, once again, the next entries (a[4]... a[6]) will hit in the TLB, as they all reside on the same page in memory.

Finally, access toa[7]causes one last TLB miss. The hardware once again consults the page table to figure out the location of this virtual page in physical memory, and updates the TLB accordingly. The final two accesses (a[8]anda[9]) receive the benefits of this TLB update; when the hardware looks in the TLB for their translations, two more hits result.

Let us summarize TLB activity during our ten accesses to the array:

miss, hit, hit, miss, hit, hit, hit, miss, hit, hit. Thus, our TLBhit rate, which is the number of hits divided by the total number of accesses, is 70%. Although this is not too high (indeed, we desire hit rates that approach 100%), it is non-zero, which may be a surprise. Even though this is the first time the program accesses the array, the TLB improves performance due tospatial locality. The elements of the array are packed tightly into pages (i.e., they are close to one another inspace), and thus only the first access to an element on a page yields a TLB miss.

Also note the role that page size plays in this example. If the page size had simply been twice as big (32 bytes, not 16), the array access would suffer even fewer misses. As typical page sizes are more like 4KB, these types of dense, array-based accesses achieve excellent TLB performance, encountering only a single miss per page of accesses.

One last point about TLB performance: if the program, soon after this loop completes, accesses the array again, we'd likely see an even better result, assuming that we have a big enough TLB to cache the needed translations: hit, hit, hit, hit, hit, hit, hit, hit, hit, hit. In this case, the

TLB hit rate would be high because oftemporal locality, i.e., the quick re-referencing of memory items intime. Like any cache, TLBs rely upon both spatial and temporal locality for success, which are program properties. If the program of interest exhibits such locality (and many programs do), the TLB hit rate will likely be high.

TIP: USECACHINGWHENPOSSIBLE

Caching is one of the most fundamental performance techniques in computer systems, one that is used again and again to make the "commoncase fast" [HP06]. The idea behind hardware caches is to take advantage oflocalityin instruction and data references. There are usually two types of locality:temporal localityandspatial locality. With temporal locality, the idea is that an instruction or data item that has been recently accessed will likely be re-accessed soon in the future. Think of loop variables or instructions in a loop; they are accessed repeatedly over time. With spatial locality, the idea is that if a program accesses memory at addressx, it will likely soon access memory nearx. Imagine here streaming through an array of some kind, accessing one element and then the next. Of course, these properties depend on the exact nature of the program, and thus are not hard-and-fast laws but more like rules of thumb.

Hardware caches, whether for instructions, data, or address translations (as in our TLB) take advantage of locality by keeping copies of memory in small, fast on-chip memory. Instead of having to go to a (slow) memory to satisfy a request, the processor can first check if a nearby copy exists in a cache; if it does, the processor can access it quickly (i.e., in a few

CPU cycles) and avoid spending the costly time it takes to access memory (many nanoseconds).

You might be wondering: if caches (like the TLB) are so great, why don't we just make bigger caches and keep all of our data in them? Unfortunately, this is where we run into more fundamental laws like those of physics. If you want a fast cache, it has to be small, as issues like the speed-of-light and other physical constraints become relevant. Any large cache by definition is slow, and thus defeats the purpose. Thus, we are stuck with small, fast caches; the question that remains is how to best use them to improve performance.

19.3 Who Handles The TLB Miss?

One question that we must answer: who handles a TLB miss? Two answers are possible: the hardware, or the software (OS). In the olden days, the hardware had complex instruction sets (sometimes calledCISC, for complex-instruction set computers) and the people who built the hardware didn't much trust those sneaky OS people. Thus, the hardware would handle the TLB miss entirely. To do this, the hardware has to know exactlywherethe page tables are located in memory (via apagetable base register, used in Line 11 in Figure 19.1), as well as theirexact format; on a miss, the hardware would "walk" the page table, find the correct page-table entry and extract the desired translation, update the TLB with the translation, and retry the instruction. An example of an "older" architecture that hashardware-managed TLBsis the Intel x86 architecture, which uses a fixedmulti-level page table(see the next chapter for details); the current page table is pointed to by the CR3 register [I09].

if (Success == True) // TLB Hit
if (CanAccess(TlbEntry.ProtectBits) == True)
Offset = VirtualAddress & OFFSET_MASK
PhysAddr = (TlbEntry.PFN << SHIFT) | Offset
Register = AccessMemory(PhysAddr)
else
RaiseException(PROTECTION_FAULT)
else // TLB Miss
RaiseException(TLB_MISS)

Figure 19.3:TLB Control Flow Algorithm (OS Handled)

More modern architectures (e.g., MIPS R10k [H93] or Sun's SPARC v9

[WG00], bothRISCor reduced-instruction set computers) have what is known as asoftware-managed TLB. On a TLB miss, the hardware simply raises an exception (line 11 in Figure 19.3), which pauses the current instruction stream, raises the privilege level to kernel mode, and jumps to atrap handler. As you might guess, this trap handler is code within the OS that is written with the express purpose of handling TLB misses.

When run, the code will lookup the translation in the page table, use special "privileged" instructions to update the TLB, and return from the trap; at this point, the hardware retries the instruction (resulting in a TLB hit).

Let's discuss a couple of important details. First, the return-from-trap instruction needs to be a little different than the return-from-trap we saw before when servicing a system call. In the latter case, the return-fromtrap should resume execution at the instructionafterthe trap into the OS, just as a return from a procedure call returns to the instruction immediately following the call into the procedure. In the former case, when returning from a TLB miss-handling trap, the hardware must resume execution at the instruction thatcausedthe trap; this retry thus lets the instruction run again, this time resulting in a TLB hit. Thus, depending on how a trap or exception was caused, the hardware must save a different

PC when trapping into the OS, in order to resume properly when the time to do so arrives.

Second, when running the TLB miss-handling code, the OS needs to be extra careful not to cause an infinite chain of TLB misses to occur. Many solutions exist; for example, you could keep TLB miss handlers in physical memory (where they areunmappedand not subject to address translation), or reserve some entries in the TLB for permanently-valid translations and use some of those permanent translation slots for the handler code itself; thesewiredtranslations always hit in the TLB.

The primary advantage of the software-managed approach isflexibility: the OS can use any data structure it wants to implement the page table, without necessitating hardware change. Another advantage issimplicity; as you can see in the TLB control flow (line 11 in Figure 19.3, in contrast to lines 11-19 in Figure 19.1), the hardware doesn't have to do much on a miss; it raises an exception, and the OS TLB miss handler does the rest.

TLB Issue Context Switches

19.5 TLB Issue: Context Switches

With TLBs, some new issues arise when switching between processes (and hence address spaces). Specifically, the TLB contains virtual-to-physical translations that are only valid for the currently running process; these translations are not meaningful for other processes. As a result, when switching from one process to another, the hardware or OS (or both) must be careful to ensure that the about-to-be-run process does not accidentally use translations from some previously run process.

To understand this situation better, let's look at an example. When one process (P1) is running, it assumes the TLB might be caching translations that are valid for it, i.e., that come from P1's page table. Assume, for this

example, that the 10th virtual page of P1 is mapped to physical frame 100.

In this example, assume another process (P2) exists, and the OS soon might decide to perform a context switch and run it. Assume here that the 10th virtual page of P2 is mapped to physical frame 170. If entries for both processes were in the TLB, the contents of the TLB would be:

VPN PFN valid prot 10 100 1 rwx

-- -- 0 -- 10 170 1 rwx

-- -- 0 --

In the TLB above, we clearly have a problem: VPN 10 translates to either PFN 100 (P1) or PFN 170 (P2), but the hardware can't distinguish which entry is meant for which process. Thus, we need to do some more work in order for the TLB to correctly and efficiently support virtualization across multiple processes. And thus, a crux:

THECRUX:

HOWTOMANAGETLB CONTENTSONA CONTEXTSWITCH

When context-switching between processes, the translations in the TLB for the last process are not meaningful to the about-to-be-run process.

What should the hardware or OS do in order to solve this problem?

There are a number of possible solutions to this problem. One approach is to simply flush the TLB on context switches, thus emptying it before running the next process. On a software-based system, this can be accomplished with an explicit (and privileged) hardware instruction; with a hardware-managed TLB, the flush could be enacted when the page-table base register is changed (note the OS must change the PTBR on a context switch anyhow). In either case, the flush operation simply sets all valid bits to 0, essentially clearing the contents of the TLB.

By flushing the TLB on each context switch, we now have a working

solution, as a process will never accidentally encounter the wrong translations in the TLB. However, there is a cost: each time a process runs, it must incur TLB misses as it touches its data and code pages. If the OS switches between processes frequently, this cost may be high.

To reduce this overhead, some systems add hardware support to enable sharing of the TLB across context switches. In particular, some hardware systems provide an address space identifier (ASID) field in the

TLB. You can think of the ASID as aprocess identifier(PID), but usually it has fewer bits (e.g., 8 bits for the ASID versus 32 bits for a PID).

If we take our example TLB from above and add ASIDs, it is clear processes can readily share the TLB: only the ASID field is needed to differentiate otherwise identical translations. Here is a depiction of a TLB with the added ASID field:

VPN PFN valid prot ASID 10 100 1 rwx 1

-- -- 0 -- -- 10 170 1 rwx 2

-- -- 0 -- --

Thus, with address-space identifiers, the TLB can hold translations from different processes at the same time without any confusion. Of course, the hardware also needs to know which process is currently running in order to perform translations, and thus the OS must, on a context switch, set some privileged register to the ASID of the current process.

As an aside, you may also have thought of another case where two entries of the TLB are remarkably similar. In this example, there are two entries for two different processes with two different VPNs that point to thesamephysical page:

VPN PFN valid prot ASID 10 101 1 r-x 1

-- -- 0 -- -- 50 101 1 r-x 2

-- -- 0 -- --

This situation might arise, for example, when two processesshare a page (a code page, for example). In the example above, Process 1 is sharing physical page 101 with Process 2; P1 maps this page into the 10th page of its address space, whereas P2 maps it to the 50th page of its address space. Sharing of code pages (in binaries, or shared libraries) is useful as it reduces the number of physical pages in use, thus reducing memory overheads.

19.6 Issue: Replacement Policy

As with any cache, and thus also with the TLB, one more issue that we must consider iscache replacement. Specifically, when we are installing a new entry in the TLB, we have to replace an old one, and thus the question: which one to replace?

THECRUX: HOWTODESIGNTLB REPLACEMENTPOLICY

Which TLB entry should be replaced when we add a new TLB entry?

The goal, of course, being to minimize themiss rate(or increasehit rate) and thus improve performance.

We will study such policies in some detail when we tackle the problem of swapping pages to disk; here we'll just highlight a few typical policies.

One common approach is to evict theleast-recently-usedorLRUentry.

LRU tries to take advantage of locality in the memory-reference stream, assuming it is likely that an entry that has not recently been used is a good candidate for eviction. Another typical approach is to use arandompolicy, which evicts a TLB mapping at random. Such a policy is useful due to its simplicity and ability to avoid corner-case behaviors; for example, a "reasonable" policy such as LRU behaves quite unreasonably when a program loops overn+ 1pages with a TLB of sizen; in this case, LRU misses upon every access, whereas random does much better.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

VPN G ASID

PFN C D V

A Real TLB Entry

19.7 A Real TLB Entry

Figure 19.4:A MIPS TLB Entry

Finally, let's briefly look at a real TLB. This example is from the MIPS

R4000 [H93], a modern system that uses software-managed TLBs; a slightly simplified MIPS TLB entry can be seen in Figure 19.4.

The MIPS R4000 supports a 32-bit address space with 4KB pages. Thus, we would expect a 20-bit VPN and 12-bit offset in our typical virtual address. However, as you can see in the TLB, there are only 19 bits for the

VPN; as it turns out, user addresses will only come from half the address space (the rest reserved for the kernel) and hence only 19 bits of VPN are needed. The VPN translates to up to a 24-bit physical frame number (PFN), and hence can support systems with up to 64GB of (physical) main memory (2244KB pages).

There are a few other interesting bits in the MIPS TLB. We see aglobal bit (G), which is used for pages that are globally-shared among processes.

Thus, if the global bit is set, the ASID is ignored. We also see the 8-bit

ASID, which the OS can use to distinguish between address spaces (as described above). One question for you: what should the OS do if there are more than 256 (28) processes running at a time? Finally, we see 3

Coherence(C) bits, which determine how a page is cached by the hardware (a bit beyond the scope of these notes); adirtybit which is marked when the page has been written to (we'll see the use of this later); a valid bit which tells the hardware if there is a valid translation present in the entry.

There is also apage maskfield (not shown), which supports multiple page sizes; we'll see later why having larger pages might be useful. Finally, some of the 64 bits are unused (shaded gray in the diagram).

MIPS TLBs usually have 32 or 64 of these entries, most of which are used by user processes as they run. However, a few are reserved for the

OS. Awiredregister can be set by the OS to tell the hardware how many slots of the TLB to reserve for the OS; the OS uses these reserved mappings for code and data that it wants to access during critical times, where a TLB miss would be problematic (e.g., in the TLB miss handler).

Because the MIPS TLB is software managed, there needs to be instructions to update the TLB. The MIPS provides four such instructions:TLBP, which probes the TLB to see if a particular translation is in there;TLBR, which reads the contents of a TLB entry into registers;TLBWI, which replaces a specific TLB entry; andTLBWR, which replaces a random TLB entry. The OS uses these instructions to manage the TLB's contents. It is of course critical that these instructions are privileged; imagine what a user process could do if it could modify the contents of the TLB (hint: just about anything, including take over the machine, run its own malicious "OS", or even make the Sun disappear).

TIP: RAM ISN'TALWAYSRAM (CULLER'SLAW)

The termrandom-access memory, or RAM, implies that you can access any part of RAM just as quickly as another. While it is generally good to think of RAM in this way, because of hardware/OS features such as the

TLB, accessing a particular page of memory may be costly, particularly if that page isn't currently mapped by your TLB. Thus, it is always good to remember the implementation tip: RAM isn't always RAM. Sometimes randomly accessing your address space, particular if the number of pages accessed exceeds the TLB coverage, can lead to severe performance penalties. Because one of our advisors, David Culler, used to always point to the TLB as the source of many performance problems, we name this law in his honor:Culler's Law.

19.8 Summary

We have seen how hardware can help us make address translation faster. By providing a small, dedicated on-chip TLB as an address-translation cache, most memory references will hopefully be handledwithouthaving to access the page table in main memory. Thus, in the common case, the performance of the program will be almost as if memory isn't being virtualized at all, an excellent achievement for an operating system, and certainly essential to the use of paging in modern systems.

However, TLBs do not make the world rosy for every program that exists. In particular, if the number of pages a program accesses in a short period of time exceeds the number of pages that fit into the TLB, the program will generate a large number of TLB misses, and thus run quite a bit more slowly. We refer to this phenomenon as exceeding theTLB coverage, and it can be quite a problem for certain programs. One solution, as we'll discuss in the next chapter, is to include support for larger page sizes; by mapping key data structures into regions of the program's address space that are mapped by larger pages, the effective coverage of the

TLB can be increased. Support for large pages is often exploited by programs such as adatabase management system(a DBMS), which have certain data structures that are both large and randomly-accessed.

One other TLB issue worth mentioning: TLB access can easily become a bottleneck in the CPU pipeline, in particular with what is called a physically-indexed cache. With such a cache, address translation has to take placebeforethe cache is accessed, which can slow things down quite a bit. Because of this potential problem, people have looked into all sorts of clever ways to access caches withvirtualaddresses, thus avoiding the expensive step of translation in the case of a cache hit. Such avirtuallyindexed cachesolves some performance problems, but introduces new issues into hardware design as well. See Wiggins's fine survey for more details [W03].

[BC91] "Performance from Architecture: Comparing a RISC and a CISC

with Similar Hardware Organization"

Learning objectives​

Prerequisites​

Module targets​

AI companion modes​

Source-of-truth note​

External enrichment​

Source provenance​

Merged source​

TLB Basic Algorithm​