Core Discrete Distribution Families

What This Concept Is

Distribution families are reusable models for recurring random processes. Rather than derive a PMF from scratch for each new problem, you learn a small set of named families, memorize their stories, and recognize which one fits the situation. The five that cover almost all engineering use cases at this level:

Bernoulli(( p )) -- a single success/failure trial. ( P(X = 1) = p ), ( P(X = 0) = 1 - p ). ( E[X] = p ), ( \mathrm{Var}(X) = p(1-p) ).
Binomial(( n, p )) -- number of successes in ( n ) independent Bernoulli(( p )) trials. ( P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} ). ( E[X] = np ), ( \mathrm{Var}(X) = np(1-p) ).
Hypergeometric(( N, K, n )) -- number of successes in ( n ) draws from a population of size ( N ) with ( K ) success items, without replacement. ( P(X = k) = \binom{K}{k}\binom{N-K}{n-k}/\binom{N}{n} ). ( E[X] = n K/N ).
Geometric(( p )) -- number of trials until (and including) the first success, with per-trial success probability ( p ). ( P(X = k) = (1-p)^{k-1} p ) for ( k = 1, 2, \dots ). ( E[X] = 1/p ), ( \mathrm{Var}(X) = (1-p)/p^2 ). (Memoryless!)
Poisson(( \lambda )) -- count of rare events over a fixed window where events occur at rate ( \lambda ). ( P(X = k) = e^{-\lambda} \lambda^k / k! ) for ( k = 0, 1, 2, \dots ). ( E[X] = \mathrm{Var}(X) = \lambda ).

The goal is not memorizing names. The goal is recognizing which story fits which process. Every family has a characteristic story; if the process matches the story, the family is the right model. If it doesn't, a different family or a custom PMF is needed.

The relationships among families are worth internalizing:

Binomial vs Hypergeometric. With replacement -> binomial; without replacement -> hypergeometric. As ( N \to \infty ) with ( K/N = p ) fixed, hypergeometric ( \to ) binomial. The difference matters most when ( n ) is close to ( N ).
Binomial vs Poisson. When ( n ) is large, ( p ) is small, and ( np = \lambda ) is moderate, ( \mathrm{Binomial}(n, p) \approx \mathrm{Poisson}(\lambda) ). This is the "law of rare events."
Geometric vs Negative Binomial. Geometric is a special case of negative binomial where you wait for just 1 success.

Why It Matters Here

These families appear everywhere in CS:

Bernoulli / Binomial: packet-loss counts, test-defect counts, coin-flip style A/B results, failure counts in a request batch.
Hypergeometric: sampling without replacement from a finite user base (audit samples, acceptance sampling).
Geometric: retries until success, login-attempt counts, first-collision waiting times, time-to-recovery.
Poisson: arrivals per unit time in queueing systems, rare-event counts (SEV-1 incidents per quarter), typo counts per page, radioactive decay counts.

Choosing the wrong family usually reveals a misunderstanding about independence, replacement, or the quantity being measured. An engineer who models login attempts as binomial will get the wrong answer if attempts depend on each other (e.g., rate limiting). An engineer who models rare-event counts as Poisson but in a region where ( \lambda ) varies with time will see bursty behavior the model cannot capture.

Concrete Examples

Example 1 -- Binomial: failed requests. A service handles 100 independent requests, each failing with ( p = 0.02 ). Let ( X ) be the number of failures. The process matches the binomial story exactly: fixed number of trials, two outcomes per trial, constant failure probability, independent trials. So ( X \sim \mathrm{Bin}(100, 0.02) ), with [ E[X] = np = 2,\qquad \mathrm{Var}(X) = np(1-p) = 1.96,\qquad P(X = 0) = 0.98^{100} \approx 0.133. ] If the requests were drawn from a fixed pool with exactly 2 known-bad requests (sampling without replacement), the model would instead be hypergeometric, and the probabilities would differ -- more sharply peaked at 2 because exhausting the bad requests limits how many failures are possible.

Example 2 -- Poisson: arrivals to a queue. A support-ticket queue receives tickets at an average rate of ( \lambda = 8 ) per hour. Under the Poisson model (ticket events independent, rate roughly constant over the hour), the number of tickets in the next hour ( X \sim \mathrm{Poisson}(8) ). Then [ P(X = 10) = \frac{e^{-8} \cdot 8^{10}}{10!} \approx 0.0993, \qquad P(X \ge 12) \approx 0.112, ] and since ( \mathrm{Var}(X) = 8 ), the standard deviation is ( \sqrt{8} \approx 2.83 ). The "( E = \mathrm{Var} )" property is a Poisson signature: if your empirical ( \mathrm{Var}/E ) ratio is near 1, Poisson is plausible; if it is much larger, the data is overdispersed (burstier than Poisson), which often means the rate ( \lambda ) itself is varying. Queueing theory assumes Poisson arrivals for tractability but real web traffic is overdispersed; that gap between model and reality drives Semester 5 and 6 corrections.

Common Confusion / Misconceptions

"Binomial and hypergeometric are interchangeable for large populations." Approximately, yes -- the distinction vanishes when ( N \gg n ). For small populations or when ( n ) is a significant fraction of ( N ), they diverge sharply. "Sampling 50 users without replacement from a pool of 1,000 vs from a pool of 10,000,000" -- the first is meaningfully hypergeometric, the second is effectively binomial.

"Geometric waits for the ( n )-th success." That is the negative binomial. Geometric waits for the first success. Confusing these is the most common distribution-family error.

"Poisson always models arrivals." Only when arrivals are independent and the rate is roughly constant. Bursty web traffic, synchronized batch-job arrivals, and login attempts clustered at 9 AM are all not Poisson. The test: does ( \mathrm{Var}/\mathrm{mean} \approx 1 )? If not, Poisson is wrong.

"If it's a count, it's binomial or Poisson." Not always. Counts with upper bounds (at most ( n ) outcomes) are binomial-like; counts without a natural upper bound but with a rate are Poisson-like; counts from a finite pool without replacement are hypergeometric; counts of trials until a target are geometric/negative binomial. The shape of the question matters.

"Geometric is memoryless, so retries reset probability." Memorylessness means ( P(X > n + k \mid X > n) = P(X > k) ): after ( n ) failures, the distribution of additional waits is the same as at the start. The probability per trial does not change; the unconditional distribution just ignores history.

How To Use It

Ask these four questions in order:

Am I counting successes, waiting for the first one, or counting events in a window? Successes -> Bernoulli / Binomial / Hypergeometric. Waiting -> Geometric / Negative Binomial. Window count of rare events -> Poisson.
Is the number of trials fixed or random? Fixed -> binomial. Random (you stop at a target) -> geometric or negative binomial.
Are trials independent? If not, binomial/geometric are wrong -- use hypergeometric (correlation via without-replacement) or build a custom chain.
Is sampling with replacement, without replacement, or in continuous time? Replacement or i.i.d. -> binomial. Finite pool without replacement -> hypergeometric. Continuous time with rate -> Poisson.

A useful secondary check: compute mean and variance from your data and see if they fit the family's signature. Binomial has ( \mathrm{Var}/\mu = 1 - p ); Poisson has ( \mathrm{Var}/\mu = 1 ); overdispersed counts have ( \mathrm{Var}/\mu > 1 ).

Transfer / Where This Shows Up Later

Semester 2 (Randomized algorithms). The number of comparisons in a randomized quicksort call is bounded via sums of Bernoulli indicators; Chernoff bounds for independent Bernoullis give the high-probability ( O(n \log n) ) guarantee.
Semester 2 (Hashing). The number of items hashed to a fixed bucket out of ( n ) total is approximately binomial ( \mathrm{Bin}(n, 1/m) ), or Poisson( (n/m) ) when ( n, m ) are large -- these are the equations behind "load factor" analysis.
Semester 5 (Queueing). The M/M/1 queue has Poisson arrivals (rate ( \lambda )) and exponential service (rate ( \mu )). Mean queue length is ( \rho / (1-\rho) ) where ( \rho = \lambda/\mu ). Both primitives are discrete-distribution families from this concept.
Semester 6 (Distributed). The probability that ( k ) out of ( n ) replicas fail in a shared-risk domain is modeled as binomial under the (often wrong) independence assumption; correlated failures require hypergeometric-style reasoning over failure-group-pool draws.
Semester 8 (SRE). Per-hour incident counts are classically modeled as Poisson; deviations from Poisson (overdispersion) are diagnosed by the variance-to-mean ratio and often trigger a root-cause investigation ("why are incidents bursty?").
Semester 9 (Canary). Success/failure counts in the canary group are binomial; the sample-size calculation for detecting a given difference in rates uses the binomial mean and variance directly.

Check Yourself

What story distinguishes geometric from binomial? From negative binomial?
Why does sampling without replacement break the binomial model? Why does the binomial approximation improve when ( N \gg n )?
When is Poisson a plausible approximation to binomial? State the conditions.
For Poisson, why does ( E[X] = \mathrm{Var}(X) )? What does it mean empirically if the data shows ( \mathrm{Var}/E \gg 1 )?
Derive the geometric PMF ( P(X = k) = (1-p)^{k-1} p ) from first principles using the event "first ( k-1 ) trials are failures and trial ( k ) is a success."
The memorylessness of geometric is expressed as ( P(X > n+k \mid X > n) = P(X > k) ). Prove this using the geometric PMF.

Mini Drill or Application

Match each scenario to the best model and justify in one sentence:

Number of heads in 20 coin flips.
Number of red cards in 5 cards drawn from a deck without replacement.
Number of login attempts until first success (each attempt independent, success probability ( p )).
Number of support tickets arriving in one hour when each individual user ticket is independent and rare.
Number of defective chips among 50 drawn from a batch of 1,000 containing exactly 30 defective ones.
Number of successful passes in a known sequence of 15 football plays, each play independent with ( p = 0.6 ).

For each, also state ( E[X] ) and ( \mathrm{Var}(X) ) using the family formulas.

Simulation drill -- variance-to-mean diagnostic. In Python, draw 10,000 samples from Poisson(( \lambda = 5 )) and compute the empirical ( \mathrm{Var}/\mathrm{mean} ). You should see ( \approx 1 ). Now simulate an "overdispersed" version: draw 10,000 samples where each sample is itself Poisson(( \Lambda )) with ( \Lambda ) randomly either 2 or 8 (each with probability 1/2). Compute the empirical mean (should be 5) and variance (should be about 5 + ( \mathrm{Var}(\Lambda) = 5 + 9 = 14 )); the variance-to-mean ratio will be around 2.8. This is the diagnostic you use to detect whether real incident or request-arrival data follows a pure Poisson or something more complex.

Read This Only If Stuck

Introduction to Probability: Bernoulli and Binomial
Introduction to Probability: Hypergeometric
Introduction to Probability: Geometric and Negative Binomial (Part 1)
Introduction to Probability: Poisson (Part 1)
Introduction to Probability: Connections between Poisson and binomial
Introduction to Probability: Connections between binomial and hypergeometric
MCS: Random Variable Examples
Wikipedia: Binomial distribution -- cross-reference for the PMF, mean/variance, and the normal and Poisson approximations.
Wikipedia: Hypergeometric distribution -- cross-reference highlighting the without-replacement contrast with binomial.

What This Concept Is​

Why It Matters Here​

Concrete Examples​

Common Confusion / Misconceptions​

How To Use It​

Transfer / Where This Shows Up Later​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​