Atomics and Memory Ordering at the C Level
What This Concept Is
C11 standardized atomics in <stdatomic.h>. An atomic operation is one that either happens entirely or not at all -- no other thread can observe it half-done.
Two orthogonal ideas:
- Atomicity. The read-modify-write of a shared counter (
*p += 1) is not atomic in general.atomic_fetch_add(p, 1)is atomic: the whole load-add-store is one indivisible operation the hardware supports via aLOCKprefix, anldxr/stxrpair, or similar. - Memory ordering. Modern CPUs and compilers reorder instructions aggressively for speed. Memory ordering specifies which reorderings are forbidden as observed by other threads.
The orderings, from strongest to weakest:
| Order | What it guarantees |
|---|---|
memory_order_seq_cst (default) | Global total order; a single consistent history across threads. |
memory_order_acq_rel | Acquire on read, release on write. Used for locks. |
memory_order_acquire | Subsequent memory accesses cannot be reordered before this load. |
memory_order_release | Prior memory accesses cannot be reordered after this store. |
memory_order_relaxed | Atomic value only. No ordering guarantees relative to other memory. |
The default (seq_cst) is the easy one to reason about; the weaker ones give measurable speedups on ARM and PowerPC but require careful invariant reasoning.
Why It Matters Here
Atomics let you build lock-free data structures (queues, counters, flags) and they are the mechanism underneath mutexes and condition variables themselves. Equally important: they are often not what you need. A while (!done) { ... } loop where done is not atomic can spin forever even after another thread writes done = 1, because the compiler is allowed to hoist the read out of the loop. The fix is atomic_bool done, not "just a volatile" -- volatile is for memory-mapped hardware registers, not threads.
Concrete Example
A shared counter, two ways:
#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>
atomic_long counter_atomic = 0;
long counter_plain = 0;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
static void *worker_atomic(void *arg) {
(void)arg;
for (int i = 0; i < 1000000; i++)
atomic_fetch_add_explicit(&counter_atomic, 1, memory_order_relaxed);
return NULL;
}
static void *worker_plain(void *arg) {
(void)arg;
for (int i = 0; i < 1000000; i++) counter_plain++;
return NULL;
}
Two threads running worker_atomic end up with counter_atomic == 2000000 every time. Two threads running worker_plain end up with an essentially random number less than 2000000 because counter_plain++ is load/add/store and the threads interleave.
A release/acquire handshake -- the pattern that replaces a mutex for a single flag:
atomic_int ready = 0;
int payload;
/* producer */
payload = 42;
atomic_store_explicit(&ready, 1, memory_order_release);
/* consumer */
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) {}
printf("%d\n", payload); /* guaranteed to be 42 */
release on the store guarantees the write payload = 42 is visible to any thread that does a matching acquire load. Without the ordering, the consumer could see ready == 1 but still read stale payload.
Common Confusion / Misconception
"Atomics replace mutexes." They replace mutexes for simple shared state: counters, flags, single-pointer handoffs. As soon as you need to update two fields consistently (e.g., head and size of a queue), a mutex is almost always clearer and faster than any lock-free design you will write without specialized training.
"volatile makes my variable thread-safe." It does not. volatile tells the compiler not to cache the value in a register across reads or reorder accesses with other volatile accesses. It says nothing about ordering relative to non-volatile memory, and it does not emit the memory fences required on multicore CPUs. Use _Atomic/std::atomic for thread communication.
"memory_order_relaxed is faster so I should use it." Only if you understand you are giving up all ordering relative to other memory. A relaxed store of a done flag can be visible to another thread before the payload the flag was supposed to signal. The payload read can then tear or return garbage. Default to seq_cst; weaken only with a specific reason and a microbenchmark.
How To Use It
Practical guidance:
- Default to mutexes for anything non-trivial. They are cheap on uncontended paths and correct on complex invariants.
- Use atomics for (a) counters, (b) single-writer/many-reader flags, (c) single-producer/single-consumer handoffs.
- Leave the ordering at
seq_cstunless a profiler shows contention and you can name the invariant the weaker order preserves. - Never use
volatilefor thread communication. - Prefer
atomic_flagwithtest_and_set/clearfor the absolute minimum primitive (a spinlock).
Check Yourself
- Why is
counter++in two threads sometimes wrong even whencounterisvolatile? - What does
memory_order_releaseon a store guarantee about earlier writes? - In which situations is "just use a mutex" the right answer?
Mini Drill or Application
Do all four:
- Compile both worker functions above. Run them 20 times each. Record the distribution of
counter_plain. - Add a mutex-protected version of the plain worker. Compare runtimes vs the atomic version at high contention (e.g., 8 threads × 1,000,000 iterations).
- Implement the
ready/payloadhandshake and measure the latency in nanoseconds withclock_gettime(CLOCK_MONOTONIC, ...). - In one sentence, explain why
memory_order_relaxedon a counter is safe but on a handoff flag is not.
Read This Only If Stuck
- COD 2.11: Parallelism and Synchronization (Part 2)
- COD 5.8: Parallelism and Memory Hierarchies -- Cache Coherence
- Man page:
man 7 stdatomic(note: atomics are a C11 language feature; see the standard) - cppreference:
<stdatomic.h>-- comprehensive reference, C view - Preshing on Programming: Acquire and Release Semantics