Skip to main content

D Bhandarkar And Douglas W Clark

This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.

Learning objectives

  • Explain the main ideas and vocabulary in D Bhandarkar And Douglas W Clark.
  • Work through the source examples for D Bhandarkar And Douglas W Clark without depending on raw chunk order.
  • Use D Bhandarkar And Douglas W Clark as selective reference when learner modules point back to Ostep.

Prerequisites

  • None curated yet.

Module targets

  • module-02-memory-management-virtual-memory

AI companion modes

  • Explain simply
  • Socratic tutor
  • Quiz me
  • Challenge my understanding
  • Diagnose my confusion
  • Generate extra practice
  • Revision mode
  • Connect forward / backward

Source-of-truth note

This unit is anchored to Ostep and the source chapter "D Bhandarkar And Douglas W Clark". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.

External enrichment

No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.

Source provenance

  • Primary source: Ostep
  • Source chapter: D Bhandarkar And Douglas W Clark
  • Raw source file: 092-d-bhandarkar-and-douglas-w-clark.md

Merged source

D Bhandarkar And Douglas W Clark

D. Bhandarkar and Douglas W. Clark

Communications of the ACM, September 1991

A great and fair comparison between RISC and CISC. The bottom line: on similar hardware, RISC was about a factor of three better in performance.

[CM00] "The evolution of RISC technology at IBM"

John Cocke and V. Markstein

IBM Journal of Research and Development, 44:1/2

A summary of the ideas and work behind the IBM 801, which many consider the first true RISC microprocessor.

[C95] "The Core of the Black Canyon Computer Corporation"

John Couleur

IEEE Annals of History of Computing, 17:4, 1995

In this fascinating historical note, Couleur talks about how he invented the TLB in 1964 while working for GE, and the fortuitous collaboration that thus ensued with the Project MAC folks at MIT.

[CG68] "Shared-access Data Processing System"

John F. Couleur and Edward L. Glaser

Patent 3412382, November 1968

The patent that contains the idea for an associative memory to store address translations. The idea, according to Couleur, came in 1964.

[CP78] "The architecture of the IBM System/370"

R.P. Case and A. Padegs

Communications of the ACM. 21:1, 73-96, January 1978

Perhaps the first paper to use the termtranslation lookaside buffer. The name arises from the historical name for a cache, which was alookaside bufferas called by those developing the Atlas system at the University of Manchester; a cache of address translations thus became atranslation lookaside buffer. Even though the term lookaside buffer fell out of favor, TLB seems to have stuck, for whatever reason.

[H93] "MIPS R4000 Microprocessor User's Manual".

Joe Heinrich, Prentice-Hall, June 1993

Available: http://cag.csail.mit.edu/raw/ documents/R4400UmanbookEd2.pdf

[HP06] "Computer Architecture: A Quantitative Approach"

John Hennessy and David Patterson

Morgan-Kaufmann, 2006

A great book about computer architecture. We have a particular attachment to the classic first edition.

[I09] "Intel 64 and IA-32 Architectures Software Developer's Manuals"

Intel, 2009

Available: http://www.intel.com/products/processor/manuals

In particular, pay attention to "Volume 3A: System Programming Guide Part 1" and "Volume 3B:

System Programming Guide Part 2"

[PS81] "RISC-I: A Reduced Instruction Set VLSI Computer"

D.A. Patterson and C.H. Sequin

ISCA '81, Minneapolis, May 1981

The paper that introduced the term RISC, and started the avalanche of research into simplifying computer chips for performance.

[SB92] "CPU Performance Evaluation and Execution Time Prediction

Using Narrow Spectrum Benchmarking"

Rafael H. Saavedra-Barrera

EECS Department, University of California, Berkeley

Technical Report No. UCB/CSD-92-684, February 1992 www.eecs.berkeley.edu/Pubs/TechRpts/1992/CSD-92-684.pdf

A great dissertation about how to predict execution time of applications by breaking them down into constituent pieces and knowing the cost of each piece. Probably the most interesting part that comes out of this work is the tool to measure details of the cache hierarchy (described in Chapter 5). Make sure to check out the wonderful diagrams therein.

[W03] "A Survey on the Interaction Between Caching, Translation and Protection"

Adam Wiggins

University of New South Wales TR UNSW-CSE-TR-0321, August, 2003

An excellent survey of how TLBs interact with other parts of the CPU pipeline, namely hardware caches.

[WG00] "The SPARC Architecture Manual: Version 9"

David L. Weaver and Tom Germond, September 2000

SPARC International, San Jose, California

Available: http://www.sparc.org/standards/SPARCV9.pdf

In this homework, you are to measure the size and cost of accessing a TLB. The idea is based on work by Saavedra-Barrera [SB92], who developed a simple but beautiful method to measure numerous aspects of cache hierarchies, all with a very simple user-level program. Read his work for more details.

The basic idea is to access some number of pages within a large data structure (e.g., an array) and to time those accesses. For example, let's say the TLB size of a machine happens to be 4 (which would be very small, but useful for the purposes of this discussion). If you write a program that touches 4 or fewer pages, each access should be a TLB hit, and thus relatively fast. However, once you touch 5 pages or more, repeatedly in a loop, each access will suddenly jump in cost, to that of a TLB miss.

The basic code to loop through an array once should look like this:

int jump = PAGESIZE / sizeof(int);
for (i = 0; i < NUMPAGES * jump; i += jump) {
a[i] += 1;
}
In this loop, one integer per page of the arrayais updated, up to the number of pages specified byNUMPAGES. By timing such a loop repeatedly (say, a few hundred million times in another loop around this one, or however many loops are needed to run for a few seconds), you can time how long each access takes (on average). By looking for jumps in cost as

NUMPAGESincreases, you can roughly determine how big the first-level

TLB is, determine whether a second-level TLB exists (and how big it is if it does), and in general get a good sense of how TLB hits and misses can affect performance.

Figure 19.5 (page 16) shows the average time per access as the number of pages accessed in the loop is increased. As you can see in the graph, when just a few pages are accessed (8 or fewer), the average access time is roughly 5 nanoseconds. When 16 or more pages are accessed, there is a sudden jump to about 20 nanoseconds per access. A final jump in cost occurs at around 1024 pages, at which point each access takes around 70 nanoseconds. From this data, we can conclude that there is a two-level

TLB hierarchy; the first is quite small (probably holding between 8 and 16 entries); the second is larger but slower (holding roughly 512 entries).

The overall difference between hits in the first-level TLB and misses is quite large, roughly a factor of fourteen. TLB performance matters!

TLB Size Measurement 80 60 40

20 Time Per Access (ns)

0 1 4 16 64 256 1024

Number Of Pages

Figure 19.5:Discovering TLB Sizes and Miss Costs

Questions

  1. For timing, you'll need to use a timer such as that made available

bygettimeofday(). How precise is such a timer? How long does an operation have to take in order for you to time it precisely?

(this will help determine how many times, in a loop, you'll have to repeat a page access in order to time it successfully)

  1. Write the program, calledtlb.c, that can roughly measure the cost

of accessing each page. Inputs to the program should be: the number of pages to touch and the number of trials.

  1. Now write a script in your favorite scripting language (csh, python,

etc.) to run this program, while varying the number of pages accessed from 1 up to a few thousand, perhaps incrementing by a factor of two per iteration. Run the script on different machines and gather some data. How many trials are needed to get reliable measurements?

  1. Next, graph the results, making a graph that looks similar to the

one above. Use a good tool likeploticus. Visualization usually makes the data much easier to digest; why do you think that is?

  1. One thing to watch out for is compiler optimzation. Compilers do

all sorts of clever things, including removing loops which increment values that no other part of the program subsequently uses.

How can you ensure the compiler does not remove the main loop above from your TLB size estimator?

  1. Another thing to watch out for is the fact that most systems today

ship with multiple CPUs, and each CPU, of course, has its own TLB hierarchy. To really get good measurements, you have to run your code on just one CPU, instead of letting the scheduler bounce it from one CPU to the next. How can you do that? (hint: look up "pinning a thread" on Google for some clues) What will happen if you don't do this, and the code moves from one CPU to the other?

  1. Another issue that might arise relates to initialization. If you don't

initialize the array aabove before accessing it, the first time you access it will be very expensive, due to initial access costs such as demand zeroing. Will this affect your code and its timing? What can you do to counterbalance these potential costs?