Chapter 10: Don't Forget Synchronization

This page is a generated reference surface for selective reading. It exists to keep the learner apps guide-first while still preserving source access.

Learning objectives

Explain the main ideas and vocabulary in Don't Forget Synchronization.
Work through the source examples for Don't Forget Synchronization without depending on raw chunk order.
Use Don't Forget Synchronization as selective reference when learner modules point back to Ostep.

Prerequisites

Earlier prerequisite concepts leading into Chapter 10: Don't Forget Synchronization.

Module targets

module-01-processes-scheduling
module-04-file-systems-io

AI companion modes

Explain simply
Socratic tutor
Quiz me
Challenge my understanding
Diagnose my confusion
Generate extra practice
Revision mode
Connect forward / backward

Source-of-truth note

This unit is anchored to Ostep and the source chapter "Chapter 10: Don't Forget Synchronization". Use external resources only to clarify, extend, or modernize details without replacing the chapter's conceptual spine.

External enrichment

No chapter-specific enrichment resources are curated yet. Add them in the unit manifest when a source clearly improves learning.

Source provenance

Primary source: Ostep
Source chapter 10: Chapter 10: Don't Forget Synchronization
Raw source file: 049-10-2-don-t-forget-synchronization.md
Raw source file: 050-10-4-single-queue-scheduling.md
Raw source file: 051-10-5-multi-queue-scheduling.md

Merged source

Don't Forget Synchronization

10.2 Don't Forget Synchronization

Given that the caches do all of this work to provide coherence, do programs (or the OS itself) have to worry about anything when they access shared data? The answer, unfortunately, is yes, and is documented in great detail in the second piece of this book on the topic of concurrency.

While we won't get into the details here, we'll sketch/review some of the basic ideas here (assuming you're familiar with concurrency).

When accessing (and in particular, updating) shared data items or structures across CPUs, mutual exclusion primitives (such as locks) should likely be used to guarantee correctness (other approaches, such as building lock-free data structures, are complex and only used on occasion;

see the chapter on deadlock in the piece on concurrency for details). For

example, assume we have a shared queue being accessed on multiple

CPUs concurrently. Without locks, adding or removing elements from the queue concurrently will not work as expected, even with the underlying coherence protocols; one needs locks to atomically update the data structure to its new state.

To make this more concrete, imagine this code sequence, which is used to remove an element from a shared linked list, as we see in Figure 10.3.

Imagine if threads on two CPUs enter this routine at the same time. If

Thread 1 executes the first line, it will have the current value of head stored in itstmpvariable; if Thread 2 then executes the first line as well, it also will have the same value ofheadstored in its own privatetmp variable (tmpis allocated on the stack, and thus each thread will have its own private storage for it). Thus, instead of each thread removing an element from the head of the list, each thread will try to remove the

struct __Node_t *next;
} Node_t;
int List_Pop() {
Node_t *tmp = head; // remember old head ...
int value = head->value; // ... and its value
head = head->next; // advance head to next pointer
free(tmp); // free old head
return value; // return value at head
}

Figure 10.3:Simple List Delete Code same head element, leading to all sorts of problems (such as an attempted double free of the head element at line 4, as well as potentially returning the same data value twice).

The solution, of course, is to make such routines correct via locking. In this case, allocating a simple mutex (e.g., pthreadmutext m;) and then adding alock(&m)at the beginning of the routine and anunlock(&m)at the end will solve the problem, ensuring that the code will execute as desired. Unfortunately, as we will see, such an approach is not without problems, in particular with regards to performance. Specifically, as the number of CPUs grows, access to a synchronized shared data structure becomes quite slow.

10.3 One Final Issue: Cache Affinity

One final issue arises in building a multiprocessor cache scheduler, known ascache affinity. This notion is simple: a process, when run on a particular CPU, builds up a fair bit of state in the caches (and TLBs) of the

CPU. The next time the process runs, it is often advantageous to run it on the same CPU, as it will run faster if some of its state is already present in the caches on that CPU. If, instead, one runs a process on a different CPU each time, the performance of the process will be worse, as it will have to reload the state each time it runs (note it will run correctly on a different

CPU thanks to the cache coherence protocols of the hardware). Thus, a multiprocessor scheduler should consider cache affinity when making its scheduling decisions, perhaps preferring to keep a process on the same

CPU if at all possible.

Single Queue Scheduling

10.4 Single-Queue Scheduling

With this background in place, we now discuss how to build a scheduler for a multiprocessor system. The most basic approach is to simply reuse the basic framework for single processor scheduling, by putting all jobs that need to be scheduled into a single queue; we call thissinglequeue multiprocessor schedulingor SQMSfor short. This approach has the advantage of simplicity; it does not require much work to take an existing policy that picks the best job to run next and adapt it to work on more than one CPU (where it might pick the best two jobs to run, if there are two CPUs, for example).

However, SQMS has obvious shortcomings. The first problem is a lack ofscalability. To ensure the scheduler works correctly on multiple CPUs, the developers will have inserted some form oflockinginto the code, as described above. Locks ensure that when SQMS code accesses the single queue (say, to find the next job to run), the proper outcome arises.

Locks, unfortunately, can greatly reduce performance, particularly as the number of CPUs in the systems grows [A91]. As contention for such a single lock increases, the system spends more and more time in lock overhead and less time doing the work the system should be doing (note:

it would be great to include a real measurement of this in here someday).

The second main problem with SQMS is cache affinity. For example, let us assume we have five jobs to run (A,B,C,D,E) and four processors.

Our scheduling queue thus looks like this:

Queue A B C D E NULL

Over time, assuming each job runs for a time slice and then another job is chosen, here is a possible job schedule across CPUs:

CPU 0 A E D C B ... (repeat) ...

CPU 1 B A E D C ... (repeat) ...

CPU 2 C B A E D ... (repeat) ...

CPU 3 D C B A E ... (repeat) ...

Because each CPU simply picks the next job to run from the globallyshared queue, each job ends up bouncing around from CPU to CPU, thus doing exactly the opposite of what would make sense from the standpoint of cache affinity.

To handle this problem, most SQMS schedulers include some kind of affinity mechanism to try to make it more likely that process will continue to run on the same CPU if possible. Specifically, one might provide affinity for some jobs, but move others around to balance load. For example, imagine the same five jobs scheduled as follows:

CPU 0 A E A A A ... (repeat) ...

CPU 1 B B E B B ... (repeat) ...

CPU 2 C C C E C ... (repeat) ...

CPU 3 D D D D E ... (repeat) ...

In this arrangement, jobsAthroughDare not moved across processors, with only jobEmigratingfrom CPU to CPU, thus preserving affinity for most. You could then decide to migrate a different job the next time through, thus achieving some kind of affinity fairness as well. Implementing such a scheme, however, can be complex.

Thus, we can see the SQMS approach has its strengths and weaknesses. It is straightforward to implement given an existing single-CPU scheduler, which by definition has only a single queue. However, it does not scale well (due to synchronization overheads), and it does not readily preserve cache affinity.

Multi Queue Scheduling

10.5 Multi-Queue Scheduling

Because of the problems caused in single-queue schedulers, some systems opt for multiple queues, e.g., one per CPU. We call this approach multi-queue multiprocessor scheduling(orMQMS).

In MQMS, our basic scheduling framework consists of multiple scheduling queues. Each queue will likely follow a particular scheduling discipline, such as round robin, though of course any algorithm can be used.

When a job enters the system, it is placed on exactly one scheduling queue, according to some heuristic (e.g., random, or picking one with fewer jobs than others). Then it is scheduled essentially independently, thus avoiding the problems of information sharing and synchronization found in the single-queue approach.

For example, assume we have a system where there are just two CPUs (labeled CPU 0 and CPU 1), and some number of jobs enter the system:

A,B,C, andDfor example. Given that each CPU has a scheduling queue now, the OS has to decide into which queue to place each job. It might do something like this:

Q0 A C Q1 B D

Depending on the queue scheduling policy, each CPU now has two jobs to choose from when deciding what should run. For example, with round robin, the system might produce a schedule that looks like this:

CPU 0 A A C C A A C C A A C C ...

CPU 1 B B D D B B D D B B D D ...

MQMS has a distinct advantage of SQMS in that it should be inherently more scalable. As the number of CPUs grows, so too does the number of queues, and thus lock and cache contention should not become a central problem. In addition, MQMS intrinsically provides cache affinity; jobs stay on the same CPU and thus reap the advantage of reusing cached contents therein.

But, if you've been paying attention, you might see that we have a new problem, which is fundamental in the multi-queue based approach:load imbalance. Let's assume we have the same set up as above (four jobs, two CPUs), but then one of the jobs (sayC) finishes. We now have the following scheduling queues:

Q0 A Q1 B D

If we then run our round-robin policy on each queue of the system, we will see this resulting schedule:

CPU 0 A A A A A A A A A A A A ...

CPU 1 B B D D B B D D B B D D ...

As you can see from this diagram,Agets twice as much CPU asBand

D, which is not the desired outcome. Even worse, let's imagine that both

AandCfinish, leaving just jobsBandDin the system. The scheduling queues will look like this:

Q0 Q1 B D

As a result, CPU 0 will be left idle! (insert dramatic and sinister music here)

And hence our CPU usage timeline looks sad:

CPU 0

CPU 1 B B D D B B D D B B D D ...

So what should a poor multi-queue multiprocessor scheduler do? How can we overcome the insidious problem of load imbalance and defeat the evil forces of ... the Decepticons1? How do we stop asking questions that are hardly relevant to this otherwise wonderful book?

1Little known fact is that the home planet of Cybertron was destroyed by bad CPU scheduling decisions. And now let that be the first and last reference to Transformers in this book, for which we sincerely apologize.

CRUX: HOWTODEALWITHLOADIMBALANCE

How should a multi-queue multiprocessor scheduler handle load imbalance, so as to better achieve its desired scheduling goals?

The obvious answer to this query is to move jobs around, a technique which we (once again) refer to asmigration. By migrating a job from one

CPU to another, true load balance can be achieved.

Let's look at a couple of examples to add some clarity. Once again, we have a situation where one CPU is idle and the other has some jobs.

Q0 Q1 B D

In this case, the desired migration is easy to understand: the OS should simply move one ofBorDto CPU 0. The result of this single job migration is evenly balanced load and everyone is happy.

A more tricky case arises in our earlier example, where Awas left alone on CPU 0 andBandDwere alternating on CPU 1:

Q0 A Q1 B D

In this case, a single migration does not solve the problem. What would you do in this case? The answer, alas, is continuous migration of one or more jobs. One possible solution is to keep switching jobs, as we see in the following timeline. In the figure, firstAis alone on CPU 0, andBandDalternate on CPU 1. After a few time slices,Bis moved to compete withAon CPU 0, whileDenjoys a few time slices alone on CPU

And thus load is balanced:

CPU 0 A A A A B A B A B B B B ...

CPU 1 B D B D D D D D A D A D ...

Of course, many other possible migration patterns exist. But now for the tricky part: how should the system decide to enact such a migration?

One basic approach is to use a technique known as work stealing

[FLR98]. With a work-stealing approach, a (source) queue that is low on jobs will occasionally peek at another (target) queue, to see how full it is. If the target queue is (notably) more full than the source queue, the source will "steal" one or more jobs from the target to help balance load.

Of course, there is a natural tension in such an approach. If you look around at other queues too often, you will suffer from high overhead and have trouble scaling, which was the entire purpose of implementing the multiple queue scheduling in the first place! If, on the other hand, you don't look at other queues very often, you are in danger of suffering from severe load imbalances. Finding the right threshold remains, as is common in system policy design, a black art.

10.6 Linux Multiprocessor Schedulers

Interestingly, in the Linux community, no common solution has approached to building a multiprocessor scheduler. Over time, three different schedulers arose: the O(1) scheduler, the Completely Fair Scheduler (CFS), and the BF Scheduler (BFS)2. See Meehean's dissertation for an excellent overview of the strengths and weaknesses of said schedulers

[M11]; here we just summarize a few of the basics.

Both O(1) and CFS use multiple queues, whereas BFS uses a single queue, showing that both approaches can be successful. Of course, there are many other details which separate these schedulers. For example, the

O(1) scheduler is a priority-based scheduler (similar to the MLFQ discussed before), changing a process's priority over time and then scheduling those with highest priority in order to meet various scheduling objectives; interactivity is a particular focus. CFS, in contrast, is a deterministic proportional-share approach (more like Stride scheduling, as discussed earlier). BFS, the only single-queue approach among the three, is also proportional-share, but based on a more complicated scheme known as

Earliest Eligible Virtual Deadline First (EEVDF) [SA96]. Read more about these modern algorithms on your own; you should be able to understand how they work now!

Learning objectives​

Prerequisites​

Module targets​

AI companion modes​

Source-of-truth note​

External enrichment​

Source provenance​

Merged source​

Don't Forget Synchronization​

10.2 Don't Forget Synchronization

10.3 One Final Issue: Cache Affinity​

Single Queue Scheduling​

10.4 Single-Queue Scheduling

Multi Queue Scheduling​

10.5 Multi-Queue Scheduling

10.6 Linux Multiprocessor Schedulers​

Learning objectives

Prerequisites

Module targets

AI companion modes

Source-of-truth note

External enrichment

Source provenance

Merged source

Don't Forget Synchronization

10.3 One Final Issue: Cache Affinity

Single Queue Scheduling

Multi Queue Scheduling

10.6 Linux Multiprocessor Schedulers