The Hardware Foundation: Atomic Instructions and Cache Coherence
What This Concept Is
Software mutual exclusion is impossible without some lower-level primitive that is indivisible. The CPU provides a small set of such primitives:
- Test-and-Set (TAS / atomic exchange). Atomically write a new value and return the old value.
- Compare-and-Swap (CAS /
cmpxchgon x86). Atomically compare a memory cell to an expected value and, if equal, replace it with a new value; in either case, report the old value. - Load-Linked / Store-Conditional (LL/SC on ARM, RISC-V, POWER). Load a value and mark the address; a later store-conditional succeeds only if no other write to that address happened in between.
- Fetch-and-Add (
xaddon x86). Atomically add to memory and return the previous value.
These instructions work against a cache coherence protocol (MESI and its variants on x86, MOESI on AMD, MESIF on some Intel) that ensures every core eventually sees a single logical value for each cache line. Coherence guarantees that two cores cannot each believe they own the same cache line for writing at the same moment.
Why It Matters Here
Every lock you use in user space ultimately delegates to one of these hardware primitives. Understanding them lets you:
- read the
pthread_mutex_lockimplementation and believe it - reason about why a CAS loop is fast under low contention and slow under high contention
- understand why a cache line that ping-pongs between cores destroys performance even when code "works"
- implement lock-free data structures from first principles
Cache coherence is also what makes atomic variables visible. A write that never leaves a core's store buffer has no defined meaning for anyone else. Atomic instructions carry ordering and visibility guarantees that plain loads and stores do not.
Concrete Example
A minimal spinlock on x86 (pseudo-assembly):
acquire:
mov eax, 1
xchg eax, [lock] ; atomic exchange: eax <- old, [lock] <- 1
test eax, eax
jnz acquire ; if old was 1, lock was held; retry
ret
release:
mov [lock], 0
ret
The xchg instruction is the TAS primitive. It takes the cache line exclusively, writes 1, and returns whatever was there. Two cores racing for the lock will serialize at the coherence protocol: exactly one xchg sees an old value of 0 and wins.
Without xchg, a naive sequence load; cmp; store would race: two cores could each load 0, each decide the lock is free, and each store 1.
Common Confusion / Misconception
"volatile makes a variable atomic." No. In C and C++ (not Java), volatile only tells the compiler not to cache the variable in a register. It does not prevent the CPU from reordering, does not guarantee atomic access to a word-sized object, and does not force cache-line visibility. Use _Atomic / std::atomic instead.
Another misconception: "Cache coherence means my writes are instantly visible." Coherence says all cores agree on a total order for writes to a single cache line. It does not say writes are instant, and it does not say writes to different cache lines appear in the same order everywhere without additional barriers.
How To Use It
When you reach for a synchronization mechanism, ask:
- What is the hardware primitive under it (TAS, CAS, LL/SC, fetch-and-add)?
- Does my use require mutual exclusion only, or also visibility and ordering?
- Will the cache line be hot (written by many cores) or cold (read-mostly)? Hot cache lines bottleneck at coherence.
If a fast path avoids the atomic, like a double-checked lock, you must add explicit memory ordering (acquire/release) to compensate.
Check Yourself
- Why can't a software-only algorithm provide mutual exclusion on modern hardware without special instructions?
- What does
xchggive you that a plainload; storepair does not? - Why is a spinlock on a hot cache line a scalability disaster even when it is "correct"?
Mini Drill or Application
Write a 10-line pseudo-assembly trace of two cores both executing xchg eax, [lock] with [lock] = 0. Show the cache-line ownership transitions (Invalid / Shared / Exclusive / Modified). Identify which xchg wins and why.
Why Memory Models Enter The Picture
Even with coherence, the CPU and compiler can reorder independent loads and stores. x86 provides a relatively strong model (TSO: total store order, with store-buffer-induced store-load reordering); ARM and POWER are much weaker. The hardware primitives above carry implicit barriers on most ISAs (for example, xchg on x86 is a full fence), but plain atomics can still be reordered against surrounding non-atomic accesses without explicit ordering annotations. That is why portable code must use memory_order_acquire on lock acquisition and memory_order_release on lock release; on x86 those become no-ops, on ARM they emit the required dmb / ldar / stlr instructions.
False Sharing: A Performance Trap That Looks Like A Correctness Problem
Cache coherence operates at the granularity of a cache line, typically 64 bytes. If two unrelated atomic variables happen to sit on the same line, every write to one invalidates the line on every other core, even though the other cores only care about the other variable. The result is a lock-free algorithm that "works" but runs 10x slower than expected. The fix is to pad each hot variable to its own cache line (alignas(64) in C++, __attribute__((aligned(64))) in GCC). Profiling tools like perf c2c on Linux surface false-sharing hotspots directly. Remember this pattern: correctness and performance both depend on understanding the cache line as the true unit of coherence.
A Mental Model: The Coherence Bus
Picture a single hardware "coherence bus" that all cores talk to. When core A wants exclusive write access to a cache line, it broadcasts an invalidation; every other core with that line in its cache must discard it and reply. Only then does A get write ownership. Each atomic write to a contended cache line costs one full round trip on that bus -- hundreds of cycles, thousands if the line is remote on a NUMA system. This is why a naive shared counter scales negatively past a few cores: every increment is a bus transaction. It is also why redesigning to per-core counters that are summed lazily (sharding) is the single most reliable scalability trick in systems programming.
Atomic Word Sizes and Natural Alignment
Every architecture guarantees atomic reads and writes for a few specific sizes at specific alignments: a 4-byte int aligned on 4 bytes is atomic on x86 and ARM; an 8-byte long aligned on 8 bytes is atomic on 64-bit CPUs but can tear on 32-bit CPUs unless the compiler emits special instructions. A struct larger than a word is never atomically readable or writable by plain loads and stores. This is why std::atomic<T> for a large T silently falls back to a lock-backed implementation on many platforms; check with std::atomic<T>::is_always_lock_free. When in doubt, keep your atomically-shared state to a single naturally-aligned word, and use a std::atomic<std::uint64_t> with bit-packed fields if you need multiple related values.
When The Hardware Guarantee Is Not Enough
Even when each individual access is atomic, you may still need ordering barriers to enforce happens-before relationships across different cache lines. A classic example is double-checked locking: the write that publishes a new object and the write that sets the "initialized" flag live in different cache lines, and without memory_order_release on the publish and memory_order_acquire on the read, a reader can see the flag set before the object fields are visible. The hardware made each store atomic, but the ordering between stores was not preserved. This is why modern synchronization APIs always specify both the atomicity and the ordering separately -- ignore either one and correctness leaks.