Skip to main content

Atomics and Memory Ordering at the C Level

What This Concept Is

C11 standardized atomics in <stdatomic.h>. An atomic operation is one that either happens entirely or not at all -- no other thread can observe it half-done.

Two orthogonal ideas:

  • Atomicity. The read-modify-write of a shared counter (*p += 1) is not atomic in general. atomic_fetch_add(p, 1) is atomic: the whole load-add-store is one indivisible operation the hardware supports via a LOCK prefix, an ldxr/stxr pair, or similar.
  • Memory ordering. Modern CPUs and compilers reorder instructions aggressively for speed. Memory ordering specifies which reorderings are forbidden as observed by other threads.

The orderings, from strongest to weakest:

OrderWhat it guarantees
memory_order_seq_cst (default)Global total order; a single consistent history across threads.
memory_order_acq_relAcquire on read, release on write. Used for locks.
memory_order_acquireSubsequent memory accesses cannot be reordered before this load.
memory_order_releasePrior memory accesses cannot be reordered after this store.
memory_order_relaxedAtomic value only. No ordering guarantees relative to other memory.

The default (seq_cst) is the easy one to reason about; the weaker ones give measurable speedups on ARM and PowerPC but require careful invariant reasoning.

Why It Matters Here

Atomics let you build lock-free data structures (queues, counters, flags) and they are the mechanism underneath mutexes and condition variables themselves. Equally important: they are often not what you need. A while (!done) { ... } loop where done is not atomic can spin forever even after another thread writes done = 1, because the compiler is allowed to hoist the read out of the loop. The fix is atomic_bool done, not "just a volatile" -- volatile is for memory-mapped hardware registers, not threads.

Concrete Example

A shared counter, two ways:

#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>

atomic_long counter_atomic = 0;
long counter_plain = 0;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;

static void *worker_atomic(void *arg) {
(void)arg;
for (int i = 0; i < 1000000; i++)
atomic_fetch_add_explicit(&counter_atomic, 1, memory_order_relaxed);
return NULL;
}

static void *worker_plain(void *arg) {
(void)arg;
for (int i = 0; i < 1000000; i++) counter_plain++;
return NULL;
}

Two threads running worker_atomic end up with counter_atomic == 2000000 every time. Two threads running worker_plain end up with an essentially random number less than 2000000 because counter_plain++ is load/add/store and the threads interleave.

A release/acquire handshake -- the pattern that replaces a mutex for a single flag:

atomic_int ready = 0;
int payload;

/* producer */
payload = 42;
atomic_store_explicit(&ready, 1, memory_order_release);

/* consumer */
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) {}
printf("%d\n", payload); /* guaranteed to be 42 */

release on the store guarantees the write payload = 42 is visible to any thread that does a matching acquire load. Without the ordering, the consumer could see ready == 1 but still read stale payload.

Common Confusion / Misconception

"Atomics replace mutexes." They replace mutexes for simple shared state: counters, flags, single-pointer handoffs. As soon as you need to update two fields consistently (e.g., head and size of a queue), a mutex is almost always clearer and faster than any lock-free design you will write without specialized training.

"volatile makes my variable thread-safe." It does not. volatile tells the compiler not to cache the value in a register across reads or reorder accesses with other volatile accesses. It says nothing about ordering relative to non-volatile memory, and it does not emit the memory fences required on multicore CPUs. Use _Atomic/std::atomic for thread communication.

"memory_order_relaxed is faster so I should use it." Only if you understand you are giving up all ordering relative to other memory. A relaxed store of a done flag can be visible to another thread before the payload the flag was supposed to signal. The payload read can then tear or return garbage. Default to seq_cst; weaken only with a specific reason and a microbenchmark.

How To Use It

Practical guidance:

  1. Default to mutexes for anything non-trivial. They are cheap on uncontended paths and correct on complex invariants.
  2. Use atomics for (a) counters, (b) single-writer/many-reader flags, (c) single-producer/single-consumer handoffs.
  3. Leave the ordering at seq_cst unless a profiler shows contention and you can name the invariant the weaker order preserves.
  4. Never use volatile for thread communication.
  5. Prefer atomic_flag with test_and_set/clear for the absolute minimum primitive (a spinlock).

Check Yourself

  1. Why is counter++ in two threads sometimes wrong even when counter is volatile?
  2. What does memory_order_release on a store guarantee about earlier writes?
  3. In which situations is "just use a mutex" the right answer?

Mini Drill or Application

Do all four:

  1. Compile both worker functions above. Run them 20 times each. Record the distribution of counter_plain.
  2. Add a mutex-protected version of the plain worker. Compare runtimes vs the atomic version at high contention (e.g., 8 threads × 1,000,000 iterations).
  3. Implement the ready/payload handshake and measure the latency in nanoseconds with clock_gettime(CLOCK_MONOTONIC, ...).
  4. In one sentence, explain why memory_order_relaxed on a counter is safe but on a handoff flag is not.

Read This Only If Stuck