Skip to main content

Pigeonhole Principle, Collisions, and Extremal Guarantees

What This Concept Is

The pigeonhole principle says: if $N$ objects are placed into $k$ boxes with $N > k$, some box contains at least two objects. The generalized form is sharper: if $N$ objects go into $k$ boxes, some box contains at least $\lceil N/k \rceil$ objects.

This is an existence principle. It does not tell you where the collision occurs, only that a collision is unavoidable. In exchange for giving up constructive witnesses, it gives you rock-solid guarantees that hold no matter how the objects are placed.

There is also a probabilistic cousin -- the birthday problem -- which says that collisions become likely long before they become forced. The pigeonhole guarantee requires $k+1$ objects for forced collision in $k$ boxes; the birthday argument shows a collision among $\sqrt{k}$-ish objects is already more likely than not. Computer science uses both: pigeonhole for worst-case lower bounds, birthday for average-case attack analysis.

The real skill is not the principle itself but identifying the right classification: what are the pigeons and what are the holes? Good pigeonhole arguments usually come from choosing a classification that looks slightly artificial at first, such as remainders modulo $k$, parity classes, interval buckets, or finite-state signatures.

Formal statements.

  • (Basic) If $f: A \to B$ and $|A| > |B|$, then $f$ is not injective.
  • (Generalized) If $f: A \to B$ and $|A| > k \cdot |B|$, then some $b \in B$ has $|f^{-1}(b)| \ge k + 1$. Equivalently, some box gets at least $\lceil |A|/|B| \rceil$ pigeons.
  • (Infinite) If infinitely many objects are classified into finitely many classes, some class contains infinitely many objects.

The last form is standard in real analysis (Bolzano-Weierstrass uses it) and in proving Ramsey-type results.

Why It Matters Here

This is one of the cleanest finite arguments in mathematics. It appears whenever a system has fewer states, remainders, labels, or categories than the number of objects you are trying to assign. In computer science, this logic underlies:

  • hash collisions: any hash function mapping $> 2^b$ inputs to $b$-bit outputs must produce collisions.
  • duplicate keys / state repetition: a DFA on a string longer than its state count revisits a state (pumping lemma).
  • lower-bound arguments: comparison-sort lower bound of $\Omega(n \log n)$ is a pigeonhole on the $n!$ permutations versus the number of comparison-tree leaves.
  • load balancing: some server must handle at least $\lceil N/k \rceil$ of $N$ requests across $k$ servers.
  • compression limits: no lossless compressor can shrink every $n$-bit input (by pigeonhole into strictly fewer bits). The impossibility is unconditional.
  • database uniqueness constraints: if more rows are inserted than distinct values in a column with a UNIQUE constraint, some insert must fail -- the database engine enforces pigeonhole.

The pigeonhole is the background physics of "something has to give" arguments.

Connection to counting and probability. Pigeonhole is the deterministic cousin of the probabilistic union bound and the combinatorial averaging argument. If you can prove an average-value claim (e.g., average degree is $\ge d$), pigeonhole promotes it to an existential claim: some specific object has the property. Conversely, when you want to prove existence without constructing a witness, averaging + pigeonhole is often the cheapest route.

Concrete Examples

Example 1: remainders. Among any $13$ integers, two have the same remainder modulo $12$, so their difference is a multiple of $12$.

Why? There are only $12$ remainder classes: ${0, 1, 2, \ldots, 11}$. Placing $13$ integers into these $12$ classes forces at least one class to contain $\lceil 13/12 \rceil = 2$ integers. Their difference is therefore a multiple of $12$. This is not a probabilistic statement -- it is a certainty. The choice of modulus $12$ is entirely up to you; any classification into finitely many classes allows pigeonhole.

Example 2: hash-function collision guarantee. A hash function $h: {0, 1}^{256} \to {0, 1}^{64}$ maps $256$-bit inputs to $64$-bit outputs. There are $2^{256}$ possible inputs and only $2^{64}$ possible outputs. By generalized pigeonhole, some output value has at least

$$\left\lceil \frac{2^{256}}{2^{64}} \right\rceil = 2^{192}$$

preimages. Collisions are not just possible -- they are enormously frequent in the universe of possible inputs. (The birthday attack then tells you they are also findable after roughly $2^{32}$ random queries in practice, but that is a separate argument.)

This impossibility proof uses no cryptographic assumption: it is structural. No hash function can escape it. Cryptographic hash functions instead aim to make collisions computationally difficult to find, not nonexistent -- the gap between "exists" and "findable" is where the entire field of preimage and collision resistance lives.

Example 3: pumping-lemma flavor. Any DFA with $k$ states that accepts a string of length $\ge k$ must revisit a state. The pigeons are the $k+1$ positions visited along the run; the holes are the $k$ states. Revisiting a state means the middle section can be pumped -- a central lemma for proving non-regularity.

Example 4: Erdős-Ko-Rado flavor. Among any sequence of $n^2 + 1$ distinct real numbers, there is a monotone (strictly increasing or strictly decreasing) subsequence of length $n + 1$ (Erdős-Szekeres, 1935).

Proof sketch. For each position $i$, let $a_i$ = length of the longest increasing subsequence ending at $i$, and $d_i$ = length of the longest decreasing subsequence ending at $i$. If no monotone subsequence of length $n + 1$ exists, then $1 \le a_i, d_i \le n$ for all $i$, giving at most $n^2$ distinct $(a_i, d_i)$ pairs. By pigeonhole, two positions $i < j$ share the same pair. But then either $x_i < x_j$ (contradicting $a_j = a_i$) or $x_i > x_j$ (contradicting $d_j = d_i$). $\blacksquare$

This argument is used in Semester 2 algorithms for patience-sorting and LIS (longest increasing subsequence) analysis.

Common Confusion / Misconceptions

  • Wrong pigeonholes. The hard part is usually not applying the principle but choosing the right classification. If the boxes are the wrong categories, you get nothing. Train the habit of asking: "What equivalence classes are forced by the problem?"
  • Demanding a constructive witness. Pigeonhole arguments prove existence without telling you exactly which objects collide. That is a feature, not a bug: you trade constructive knowledge for unconditional certainty.
  • Confusing pigeonhole with birthday. Pigeonhole says collisions are forced past a threshold; the birthday problem says collisions are likely far below that threshold. Both matter, but they answer different questions.
  • Applying the basic form when generalization is needed. "$N$ into $k$, some box has $\ge 2$" is the weak form; "$N$ into $k$, some box has $\ge \lceil N/k \rceil$" is often what you really want and is strictly stronger.
  • "The principle is trivial, so the proof must be easy." The statement is trivial; the application often requires a nontrivial construction of the right classification. Olympiad problems that succumb to pigeonhole are usually rated as hard specifically because the right partition is non-obvious.

How To Use It

Ask, in order:

  1. What are the objects (the pigeons)?
  2. What are the categories (the holes)?
  3. Why does every object belong to exactly one category?
  4. Why are there strictly more objects than categories (or at least $km + 1$ for an $m+1$-pigeon claim)?
  5. What conclusion about a specific pair (or group) of objects does the collision give?
  6. Is the weak form enough, or do I need the generalized $\lceil N/k \rceil$ form?
  7. Could a birthday-style probabilistic argument give a tighter bound in expectation?

Good pigeonhole arguments come from choosing a slightly unexpected classification: remainders mod some well-chosen $k$, first-differences, subset sums, parity of something computed from the object.

Shortcut patterns.

  • "Among $k+1$ items, two share some attribute from a $k$-valued domain." $\to$ plain pigeonhole.
  • "Consecutive differences out of a bounded range cannot all avoid repetition." $\to$ pigeonhole on partial sums or differences.
  • "A long string processed by a finite state machine must revisit a state." $\to$ DFA pigeonhole, gateway to the pumping lemma.
  • "A function loses information because its codomain is smaller than its domain." $\to$ lower bounds in data structures, cryptography, information theory.

Transfer / Where This Shows Up Later

  • Semester 2 algorithms: comparison-sort lower bound ($\Omega(n \log n)$) is pigeonhole on permutations vs decision-tree leaves. Lower bounds for hashing and for Bloom filters use pigeonhole on bit count vs distinguishable inputs.
  • Semester 3 data structures: any static dictionary on $n$ keys with $< n$ bits per key must lose information -- pigeonhole on codeword space.
  • Semester 4 databases: hash-join collision rates are bounded via birthday/pigeonhole; column-statistics estimators rely on pigeonhole over sampled bucket counts.
  • Semester 5 OS: address-space collisions in randomization schemes (ASLR) are pigeonhole arguments on entropy vs allocation rate. Scheduler fairness bounds use $\lceil N/k \rceil$ reasoning.
  • Semester 6 distributed systems: consistent hashing analyses use birthday/pigeonhole to predict load imbalance; quorum intersection theorems use $\lceil N/2 \rceil + 1$ pigeonhole reasoning to prove majority sets overlap.
  • Cryptography (Semester 7+): birthday bound $2^{n/2}$ on hash collisions dictates the bit-length choices for SHA-256 (resist $2^{128}$-work collision search) versus MD5 (broken at $2^{64}$).
  • Information theory: Kraft's inequality and the source-coding lower bound use pigeonhole on codeword space to prove that no prefix-free code can beat the entropy rate by more than $1$ bit per symbol.
  • Compilers / SSA: register allocation's spilling argument ("if interference graph needs $> k$ colors, some live range spills to memory") is pigeonhole on available registers.

Check Yourself

  1. What determines the pigeonholes in a proof? Is the choice unique?
  2. Why is $\lceil N/k \rceil$ the right lower bound in the generalized form? Prove it by contradiction.
  3. What kind of conclusion does this method usually give: exact count or guaranteed existence?
  4. State the difference between the pigeonhole guarantee and the birthday-problem expectation in one sentence each.
  5. Any function from a $7$-element set to a $4$-element set fails to be injective. Why? Identify the pigeons and holes precisely.
  6. State the infinite pigeonhole principle. Use it to prove that every infinite binary sequence has an infinite constant subsequence.
  7. For a comparison-based sort on $n$ elements, there are $n!$ possible outputs. Why does the sort require at least $\lceil \log_2(n!) \rceil$ comparisons in the worst case? (Classical pigeonhole lower bound.)
  8. Compare the pigeonhole guarantee "$k+1$ items force a collision in $k$ bins" to the birthday expectation "$\Theta(\sqrt{k})$ items expect a collision." Why does the latter not contradict the former?

Mini Drill or Application

For each claim, identify the pigeons and the holes, then justify:

  1. Among any $367$ people, two share a birthday.
  2. Among any $6$ integers from ${1, \ldots, 10}$, two of them sum to $11$. (Hint: pair up ${1,10}, {2,9}, {3,8}, {4,7}, {5,6}$.)
  3. Any sequence of $n^2 + 1$ distinct real numbers contains a monotone subsequence of length $n+1$ (Erdős-Szekeres -- a classic pigeonhole application).
  4. In any group of $6$ people, three know each other or three are mutual strangers. (Ramsey-style.)
  5. Show that any set of $n+1$ numbers chosen from ${1, 2, \ldots, 2n}$ contains two numbers where one divides the other. (Hint: classify each by its largest odd divisor.)
  6. Chinese Remainder-flavored: among any $11$ integers, show some nonempty subset has sum divisible by $11$. (Hint: partial sums modulo $11$.)
  7. Any $n + 1$ points placed in a unit square contain two whose distance is $\le \sqrt{2}/\sqrt{n}$ -- essentially. Partition the square into $n$ cells of side $1/\sqrt{n}$ and apply pigeonhole.
  8. Show that any function $f: {1, \ldots, 2n} \to {1, \ldots, n}$ has a large fiber: some output has at least $2$ preimages. How large does $\lceil 2n/n \rceil$ guarantee?

Read This Only If Stuck