Skip to main content

Radix Sort and Linear in Key Width

What This Concept Is

Radix sort sorts fixed-width keys one digit at a time using a stable sort (usually counting sort) as the inner pass.

Least-significant-digit (LSD) radix sort:

def radix_sort_lsd(A, d, base=10):
for digit in range(d):
A = counting_sort_by_digit(A, digit, base)
return A

If each key has (d) digits in base (b), and counting sort costs (\Theta(n + b)), the total is (\Theta(d,(n + b))). When (b = \Theta(n)) and (d) is a constant -- for example, 32-bit integers sorted as four 8-bit digits -- this is (\Theta(n)).

Radix sort requires a stable inner sort. If you sort from least to most significant digit, stability of the previous pass is what makes the most recent digit dominate correctly. There are two directions:

  • LSD radix (least-significant-digit first) -- straightforward loop, processes fixed-width keys, needs stable inner sort. The right choice for integers and fixed-length strings.
  • MSD radix (most-significant-digit first) -- recursive; partitions by top digit then recurses. Handles variable-length strings, can short-circuit on early mismatches, and is the basis for three-way radix quicksort (Sedgewick) and the burstsort family.

Distinguish radix from its neighbors: counting sort assumes the entire key is a small integer; trie-based sorts use the same digit decomposition but build a tree instead of reshuffling arrays; bucket sort assumes a distribution on keys rather than fixed width. Radix sort is the one to reach for on short fixed-length byte strings.

Why It Matters Here

Radix sort is how real systems sort fixed-width things fast: integers, IP addresses, short strings, floating-point numbers reinterpreted as integers (with one sign-bit flip trick). It is also how you see that "linear time" always has a hidden parameter -- the width of a key.

Reading (\Theta(d n)) as "linear in (n)" is only honest if (d) is treated as a constant. If keys are logarithmic in (n) (which they must be to be distinct), (d = \Theta(\log n / \log b)), and the total work is (\Theta(n \log n / \log b)) -- still competitive with comparison sort, and often faster because each step is cache-friendly arithmetic rather than branch-heavy comparison.

Radix sort also connects directly to later systems work: byte-at-a-time radix sort is the basis for many database ORDER BY implementations on fixed-width columns, and MSD radix is how production key-value stores sort prefixes in SSTables.

Concrete Examples

Example 1 -- LSD base-10 radix on 3-digit integers. Sort [170, 045, 075, 090, 802, 024, 002, 066] digit by digit.

Pass on ones digit (stable counting sort): [170, 090, 802, 002, 024, 045, 075, 066]

Pass on tens digit: [802, 002, 024, 045, 066, 170, 075, 090]

Pass on hundreds digit: [002, 024, 045, 066, 075, 090, 170, 802]

After each pass the array is sorted by the digits seen so far, and stability of the inner sort preserves earlier orderings among equal current-digit keys.

Example 2 -- 32-bit integers via four 8-bit passes. Given (n = 10^8) random 32-bit integers, the choice is (d = 4), (b = 256). Total work (\sim 4 \cdot (n + 256) \approx 4 \times 10^8) operations, each a cache-friendly array increment or indexed write. A well-tuned comparison sort (e.g., std::sort) on the same input does (\sim n \log_2 n \approx 2.7 \times 10^9) comparisons with branch mispredictions and pointer chasing. Measured on modern CPUs, LSD radix beats comparison sort by 2-4x in wall clock; on GPUs the ratio is closer to 10x because the histogram + prefix-scan pattern maps perfectly to SIMD lanes.

To sort signed integers by radix, flip the sign bit (bitwise XOR with 0x80000000) before sorting and flip it back afterward -- this converts the two's-complement total order into an unsigned lex order over bytes. The same trick, plus XOR of the exponent, sorts IEEE-754 floats.

Complexity derivation. (d) digit passes (\times) one counting-sort pass each: (T(n) = d \cdot (n + b)). With (b = 2^w) (so each pass is a (w)-bit digit) and (d = \lceil k / w \rceil) for (k)-bit keys, (T(n) = \lceil k/w \rceil (n + 2^w)). Minimizing over (w) for a given (n) and (k): set (\partial / \partial w = 0) to find (w \approx \log_2 n), giving (T = \Theta(k n / \log n)). For fixed (k) (32-bit keys), this is (\Theta(n / \log n)) digit-work per element, which is why well-tuned radix sorts cross over below (n \log n) comparison sorts in practice.

Common Confusion / Misconceptions

"LSD and MSD are interchangeable." LSD radix processes least-significant digits first and requires a stable inner sort. MSD radix processes most-significant first, recurses on partitions, and is useful for variable-length strings where early mismatches stop the comparison. Mixing them up either breaks correctness (LSD without stability) or gives up the whole advantage (MSD without early termination).

"Radix is linear." Radix is linear only when (d) is bounded. For unbounded-precision integers where (d = \Theta(\log n)), radix sort becomes (\Theta(n \log n)) -- the same asymptotics as comparison sort, though usually with better constants.

"Radix works on any comparable type." It only works when a key has a digit or byte decomposition with a well-defined comparison order on those digits -- either naturally (integers, ASCII strings) or after encoding (locale-aware string collation, IEEE-754 float canonicalization). For arbitrary objects with a custom comparator, radix does not apply.

"Base choice is cosmetic." Base (b) is a major performance knob. Too small, and (d) grows; too large, and (b) dominates the (\Theta(n + b)) per pass and the count array blows out cache. Production code typically uses (b = 256) (one byte per pass) to match word addressing, SIMD shuffle instructions, and L1 cache line boundaries.

How To Use It

Use radix sort when:

  1. keys are fixed-width integers, short strings, or fixed-length byte sequences
  2. (n) is large enough that the linear behavior pays for the constant factors
  3. you need deterministic worst case and stability
  4. comparisons are expensive relative to integer arithmetic (e.g., sorting 128-bit integers, UUIDs)
  5. the workload is cache-sensitive and comparison sort's branch mispredictions hurt

Do not use radix sort for:

  1. general comparable objects without digit decomposition
  2. small (n), where constants dominate (cutover is typically (n \geq 100) for in-memory radix)
  3. wildly variable-length keys unless you switch to MSD radix with early termination
  4. tiny key ranges where a single counting-sort pass is already (\Theta(n))

Tuning playbook:

  1. Choose the base (b) as a power of 2 matching a native word size (256 for byte-level passes is the standard).
  2. Tune so (d \times \text{passes}) is small and each pass fits in L1/L2 cache.
  3. For GPU and wide-SIMD implementations, use the standard "histogram / exclusive-scan / scatter" triple.
  4. On mixed workloads (some passes over random data, some over nearly-sorted), consider MSD-first with a quicksort cutover inside buckets -- the pattern used by Sedgewick's 3-way radix quicksort.

Transfer / Where This Shows Up Later

  • S2 M2 (this module). LSD radix with counting sort as its inner pass is the direct ancestor of every high-throughput sort you will see in concept 07 (external and practical sorting).
  • S2 M5 (advanced structures). Suffix arrays are built with DC3/SA-IS algorithms that repeatedly radix-sort triples of characters; tries and compressed radix trees use the same digit-decomposition idea.
  • S3 (software design). "Process keys digit-by-digit" is the ordering pattern you use whenever you implement multi-key sorting as repeated stable single-key passes.
  • S4 (systems programming). GPU/SIMD sorting libraries (CUB, Thrust, Arrow) ship radix sort as their primary kernel because the data-parallel histogram/prefix-scan pattern maps to hardware.
  • S5 (OS). File systems that sort inodes by hashed prefixes for directory lookup use byte-at-a-time radix order to exploit cache lines.
  • S6 (databases). ORDER BY on fixed-width numeric or fixed-length string columns frequently dispatches to a radix sort; LSM-trees store keys in radix-sorted order within each SSTable; partition-pruning on range-partitioned tables uses the radix prefix.
  • S7 / S8 (architecture, system design). Distributed shuffle phases (MapReduce, Spark) use radix-style partitioning on hash prefixes to bucket records to reducers. Range-partitioning by a fixed prefix is the architectural generalization of MSD radix.

Check Yourself

  1. Why does radix sort require a stable inner sort?
  2. When is (\Theta(d,(n + b))) actually linear in (n)?
  3. Why does MSD radix work for variable-length strings but LSD radix does not directly?
  4. What is the trick for radix-sorting signed integers? IEEE-754 floats?
  5. Why is base (b = 256) such a popular choice in production?
  6. Derive the optimal (b) for given (n) and key width (k) by minimizing (\lceil k/w \rceil (n + 2^w)). Why does the answer depend on (n) as well as (k)?
  7. Why are parallel LSD radix implementations memory-bound rather than compute-bound on modern hardware?

Mini Drill or Application

  1. Run LSD radix on [329, 457, 657, 839, 436, 720, 355] base 10 and write the array after each digit pass.
  2. Argue why sorting 10 million 32-bit integers by four 8-bit radix passes beats quicksort in practice, using concrete instruction counts.
  3. Describe how you would radix-sort 20-character ASCII strings of variable length using MSD.
  4. Implement a byte-at-a-time LSD radix sort and benchmark it against std::sort on random 64-bit integers for (n = 10^7).
  5. Extend your implementation to sort IEEE-754 float64 values by the exponent-XOR trick; verify on NaN-free input.
  6. Production scenario. A telemetry service ingests 5 billion 128-bit UUIDs per day and must emit them sorted for ID-range-based shard assignment. Compare the wall-clock cost of LSD radix (16 byte-passes) versus std::sort on a 32-core machine; account for memory bandwidth (you cannot avoid scanning 80 GB sixteen times) and explain why a two-level (MSD-then-LSD) variant usually wins for this scale.

Read This Only If Stuck