Counting Sort and When Integer Keys Break the Lower Bound

What This Concept Is

Counting sort sorts n integer keys in the range [0, k-1] by counting how many times each value occurs, then writing them back in order.

def counting_sort(A, k):
    count = [0] * k
    for x in A:
        count[x] += 1
    prefix = 0
    for v in range(k):
        c = count[v]
        count[v] = prefix
        prefix += c
    output = [0] * len(A)
    for x in A:
        output[count[x]] = x
        count[x] += 1
    return output

Running time is (\Theta(n + k)). When (k = O(n)), this is (\Theta(n)) -- strictly better than the (\Omega(n \log n)) comparison lower bound. Memory is (\Theta(n + k)) for the count and output arrays.

Counting sort is stable when the final pass walks A left to right and uses count[v] as a placement cursor. Stability is why it serves as the inner pass of radix sort. Three common variants to distinguish:

Classic three-pass counting sort (as above) -- stable, needs (\Theta(n)) output buffer.
In-place counting sort -- writes back into A itself; loses stability but saves memory.
Bucket sort -- a close cousin that uses (m) buckets and a secondary sort inside each; tuned for real-valued keys drawn from a known distribution.

Compare to its model-mates: radix sort uses counting sort as a subroutine on digits (concept 06); bucket sort assumes uniform input and gets expected (\Theta(n)) with high probability but (\Theta(n^2)) worst case. Counting sort, unlike both, is deterministic: (\Theta(n + k)) always, no probability assumption.

Why It Matters Here

Counting sort is the clean example of "break the lower bound by using something other than comparisons." It uses keys as array indices, which comparison sorts cannot do. That one move is the entire trick.

It is also:

the inner engine of radix sort
the building block for bucketing schemes (histograms, counting inversions by value, order-statistic queries over small ranges)
often the fastest thing in practice when the key range is small and known (grades A-F, HTTP status codes, age in years, short bit strings)
the obvious answer to "bin count by category" -- a sort followed by a group-by costs (\Theta(n \log n)), counting sort and its read-out cost (\Theta(n + k))

Concrete Examples

Example 1 -- trace on a small array. Sort (A = [2, 5, 3, 0, 2, 3, 0, 3]) with (k = 6).

Counts: [2, 0, 2, 3, 0, 1]
Prefix sums (starting positions): [0, 2, 2, 4, 7, 7]
Walk (A) left-to-right, placing each (x) at count[x] and incrementing:

-> index 2 ; count becomes [0, 2, 3, 4, 7, 7]
-> index 7 ; count becomes [0, 2, 3, 4, 7, 8]
-> index 4 ; count becomes [0, 2, 3, 5, 7, 8]
-> index 0 ; count becomes [1, 2, 3, 5, 7, 8]
-> index 3 ; count becomes [1, 2, 4, 5, 7, 8]
-> index 5 ; count becomes [1, 2, 4, 6, 7, 8]
-> index 1 ; count becomes [2, 2, 4, 6, 7, 8]
-> index 6 ; count becomes [2, 2, 4, 7, 7, 8]

Output: [0, 0, 2, 2, 3, 3, 3, 5]. Stable, correct, no key comparison happened.

Example 2 -- sorting records by a small key. Sort 10 million log events by HTTP status code (values in {100, 101, ..., 599}, so (k = 500)). Comparison sort: (\sim 10^7 \cdot \log_2 10^7 \approx 2.3 \times 10^8) comparisons, each touching a record in potentially scattered memory. Counting sort: 10 million increments (streaming), 500 prefix sums, 10 million writes in status-code order -- total (\sim 2 \times 10^7) operations, each on sequential memory, plus (4 ,\text{KB}) for the count array. On modern hardware the counting-sort version is measurably 5-10x faster and has no branch mispredictions on the counting pass.

Common Confusion / Misconceptions

"Counting sort is magic." It is not. It works because keys are small integers (so they can index a fixed-size array) and because the algorithm never asks A[i] < A[j]; it asks count[A[i]]. If keys are strings, floats, or objects without a natural index, counting sort does not apply without an encoding.

"Counting sort is always faster than comparison sort." Only when (k = O(n)). If (k = n^2), then (\Theta(n + k) = \Theta(n^2)) and counting sort loses to mergesort. For sparse key ranges, reach for hashing-based bucket sort or stick with comparison sort.

"Stability comes for free." It does not -- you lose it if you walk the array right-to-left, or if you decrement count[v] before placing. The canonical two-pass prefix-sum-then-place is the pattern that preserves stability, and radix sort breaks without it.

"Counting sort is embarrassingly parallel." The counting pass is parallel-friendly (atomic increments or thread-local histograms merged at the end), but the placement pass has a data dependency on the running prefix-sum cursor. Production parallel counting sorts split the input into chunks, compute local prefix sums, and combine with a parallel scan -- this is the canonical "histogram + prefix-scan" pattern in GPU sorting libraries.

How To Use It

Use counting sort when:

keys are integers in a known, bounded range
(k) is (O(n)) or smaller
you need a stable sort as a subroutine (e.g., radix sort)
you are sorting to compute histograms or rank queries
the inner loop must be branch-free (counting increments and prefix adds are excellent here)

Do not use counting sort for:

general-purpose comparable objects without digit or index decomposition
sparse key ranges where (k \gg n)
when you need the output stream, not a materialized array -- counting sort requires a second pass for placement

When designing a pipeline, counting sort is your go-to primitive whenever the input has a small discrete label set -- grades A-F, ages 0-120, short fixed-length bit strings, bucket ids.

Transfer / Where This Shows Up Later

S2 M2 (this module). Counting sort is the stable inner pass of radix sort (concept 06) and the histogram primitive in external sorting (concept 07).
S2 M3 (graphs). Bucketed priority queues (for Dijkstra on integer-weighted graphs) are counting-sort-shaped -- one bucket per distance value.
S3 (software design). "Index by a small discrete tag" is the data-structure choice that turns an (O(n)) filter into (O(1)) lookup; clean code often refactors into a count-array when it sees a switch on an enum.
S4 (systems programming). Cache-friendly histogram kernels in SIMD and GPU code are literal counting sort. The NVIDIA CUB library's radix sort uses counting sort at each digit.
S5 (OS). Process-state histograms for schedulers (how many processes in each priority band) are incrementally maintained counting-sort tables.
S6 (databases). Group-by aggregation over low-cardinality columns (e.g., country, category) uses counting-sort-style histograms; many columnar engines (ClickHouse, DuckDB) specialize this path.
S7 / S8 (architecture, system design). Log-aggregation services bucket requests by status code in (\Theta(n + k)) for dashboards; consistent-hashing virtual-node distributions are counted with a counting-sort histogram before rebalancing.

Check Yourself

Why is counting sort (\Theta(n + k)) rather than (\Theta(n \log n))?
Which step of the algorithm makes it stable, and why does radix sort need that?
When does counting sort fail to be faster than mergesort?
What goes wrong if you do the placement pass right-to-left instead of left-to-right?
Why is the memory cost (\Theta(n + k)) rather than (\Theta(k))?

Mini Drill or Application

Run counting sort on [4, 1, 3, 4, 2, 1, 0, 4, 2] with (k = 5), writing the count and prefix arrays at each step.
Write an in-place variant that handles (n = 10^6) values in the range [0, 1000] without allocating a separate output array. What do you lose?
Explain in one sentence why counting sort does not contradict the comparison-sort lower bound.
Implement counting sort on records keyed by an integer status_code, carrying a payload of arbitrary size. Verify stability on a small example with duplicates.
Parallelize the counting pass using thread-local histograms and a final reduction; measure speedup on 8 cores with (n = 10^8) and (k = 256).
Production scenario. Your observability service ingests 50M log lines per minute and must emit per-endpoint histograms by HTTP status code (values in [100, 599]). Sketch the counting-sort-shaped pipeline, argue why (\Theta(n + k)) beats sort \| uniq -c for this load, and compute the memory footprint at 1M endpoints × 500 status values. Where does the design break if endpoints are attacker-supplied and the cardinality explodes?

Read This Only If Stuck

CLRS: 8.2 Counting Sort
CLRS: 8.4 Bucket Sort -- bucket sort as counting sort's distribution cousin.
CLRS: 8.4 Bucket Sort (analysis)
Skiena: 4.7 Distribution sort via bucketing
Skiena: 4.1 Applications of Sorting
Competitive Programming: 9.32 Sorting in Linear Time
Princeton Sedgewick: 5.1 Key-Indexed Counting -- the canonical stable key-indexed counting implementation.
MIT 6.006 Lecture 5: Linear-Time Sorting -- counting sort framed against the comparison lower bound.
VisuAlgo: Sorting (scroll to counting sort) -- step visualization of the prefix-sum placement pass.
cp-algorithms: Linear-time sorting -- contest-grade counting sort and its use inside radix sort.
Rosetta Code: Counting sort -- side-by-side reference implementations in 50+ languages for spotting corner cases.

What This Concept Is​

Why It Matters Here​

Concrete Examples​

Common Confusion / Misconceptions​

How To Use It​

Transfer / Where This Shows Up Later​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​