Skip to main content

Concurrency and Debugging Clinic

Retrieval Prompts

  1. List what threads in a process share and what each has its own copy of.
  2. Write the canonical pthread_cond_wait pattern with mutex, predicate, and the while loop.
  3. Name two races the mutex prevents in a bounded-buffer queue and name one race the condition variable prevents.
  4. State, in one sentence, when atomic is sufficient and when you must still take a mutex.
  5. Describe what gdb watch does and when you would reach for it over a breakpoint.

Compare and Distinguish

  • mutex vs atomic
  • signal vs broadcast
  • breakpoint vs watchpoint
  • perf record vs strace
  • "spurious wake-up" vs "lost wake-up"

Common Mistake Check

  1. Writing if (empty) cond_wait(...) instead of while (empty) cond_wait(...).
  2. Using volatile int done as a cross-thread stop flag and then wondering why the worker loops forever.
  3. Signalling a condition variable after unlocking the mutex, and losing the wake-up.
  4. Passing &local to pthread_create and returning from the enclosing function before the thread reads it.
  5. Attempting to debug an -O3 build with gdb and being surprised that half the locals are "value optimized out."

Mini Application: Producer-Consumer From Memory

Write, with no references, a producer-consumer program where:

  1. One producer emits the integers 1..10000.
  2. One consumer sums them.
  3. The queue capacity is 8.
  4. The program exits by having the producer push a sentinel (e.g., -1) after the last value.
  5. Final sum is printed and equals 50005000.

Walk through your own source and annotate each line of q_put and q_get with the race it prevents (mutex vs cond-wait vs signal).

Mini Application: Debug a Planted Race

Start from this deliberately buggy increment program:

#include <pthread.h>
#include <stdio.h>

long counter = 0;

void *bump(void *_) {
for (int i = 0; i < 1000000; i++) counter++;
return NULL;
}

int main(void) {
pthread_t t[4];
for (int i = 0; i < 4; i++) pthread_create(&t[i], NULL, bump, NULL);
for (int i = 0; i < 4; i++) pthread_join(t[i], NULL);
printf("%ld\n", counter); /* expected 4000000, actually less */
}

Tasks:

  1. Run it 20 times. Record the distribution of final values.
  2. Fix with atomic_long. Re-run and confirm 4000000 every time.
  3. Fix again with a pthread_mutex_t (revert the atomic). Measure the runtime difference.
  4. In a paragraph, explain which fix you would use in a real server that increments a metrics counter 10 M times per second.

Mini Application: Core-Dump Autopsy

Given this program:

#include <string.h>
void crash(char *dst, const char *src) { strcpy(dst, src); }
int main(void) {
char buf[8];
crash(buf, "this is way too long");
return 0;
}
  1. Enable core dumps (ulimit -c unlimited), build with -g -O0, and run.
  2. Open the core with gdb ./crash core, produce a full backtrace.
  3. In the main frame, print buf and sizeof(buf). Explain the discrepancy.
  4. Rebuild with -fsanitize=address, rerun. Quote the first four lines of the ASan report and point at the line number that caused the overflow.

Mini Application: strace the Hang

Take any program that uses a mutex. Deliberately forget to pthread_mutex_unlock, causing a deadlock. Run it under strace -f -p <pid>. Identify the line of output that shows the thread stuck in futex_wait. Fix the unlock and re-run.

Scenarios

  1. A multi-threaded web cache sometimes returns the wrong URL's body to the client. Under helgrind, one read of cache[url] is unprotected. Why does this matter if writes are protected?
  2. A queue uses cond_signal per put and one consumer. Throughput is fine. A team adds three more consumers; throughput collapses. Diagnose.
  3. A program is correct under -O0 and wrong under -O2. The symptom is a stale read of a shared flag. What is the root cause and the fix?
  4. A gdb session shows counter = 0 repeatedly, but the program prints counter = 4000000. The program is multi-threaded. Why might gdb be showing a different thread's local copy?
  5. perf record -g on a lock-heavy workload shows 60% of CPU in __lll_lock_wake. What does that tell you, and what are your three likely fixes?

Evidence Check

Complete when: your producer-consumer runs with 4 producers and 4 consumers without losing or duplicating any item, your planted-race fix scripts match the expected output across 20 runs, and you can read a core dump and name the offending line in under two minutes.