Skip to main content

Probability Starts with a Well-Defined Model

What This Concept Is

Probability does not begin with formulas. It begins with a model of uncertainty. Before any number can mean anything, you have to decide what the random experiment is, what its possible outcomes are, and how those outcomes get weighted.

A probability model has three parts (the Kolmogorov triple):

  • a sample space ( S ) of all possible outcomes of a single trial
  • a collection of events, which are subsets of ( S )
  • a probability measure ( P ) assigning numbers to events such that ( P(\emptyset) = 0 ), ( P(S) = 1 ), and ( P!\left(\bigcup_i A_i\right) = \sum_i P(A_i) ) for disjoint events

The discipline that matters most in practice is choosing what counts as one outcome. Change that choice and every computed probability changes. Blitzstein calls this Pebble World: think of the sample space as a table of pebbles, and probabilities as masses attached to pebbles. A well-defined model is a well-labeled set of pebbles with a consistent mass.

The subtlety is that the same experiment admits multiple valid sample spaces. What makes a sample space good is not a rule in the formal definition; it is the pragmatic requirement that the outcomes you picked are at a granularity where you can actually justify the weights you assign. If the pebbles are too coarse, symmetry becomes a lie. If they are too fine, counting becomes unnecessarily hard.

The three axioms above are called the Kolmogorov axioms and are the smallest consistent set of rules that any probability assignment must obey. From them every identity in this module follows: the complement rule ( P(A^c) = 1 - P(A) ), the addition rule ( P(A \cup B) = P(A) + P(B) - P(A \cap B) ), monotonicity (( A \subseteq B \Rightarrow P(A) \le P(B) )), and the finite-union bound. Nothing needs to be memorized beyond these three axioms -- everything else is derived.

For finite sample spaces the axioms reduce to the naive rule ( P(A) = |A|/|S| ) when weights are uniform. For countably infinite sample spaces (e.g., the number of tosses until the first head), probabilities are assigned via an infinite sum that must still total 1. For uncountable spaces (e.g., real-valued latencies), the weights are given by a density rather than point masses, and probability is computed by integration -- this is the continuous case in Cluster 5. All three cases obey the same axioms; the mechanical formulas differ, but the modeling discipline does not.

Three vocabulary items that will recur in every later cluster:

  • Outcome ( \omega \in S ): one fully-specified result of the experiment.
  • Event ( A \subseteq S ): a set of outcomes you care about.
  • Probability measure ( P ): the function that assigns a number in ([0,1]) to each event.

Mastering these three as distinct objects -- not as interchangeable synonyms -- is the deepest prerequisite for this module.

Why It Matters Here

Most probability errors in computer science are not algebra mistakes. They are modeling mistakes: the sample space was wrong, or the granularity was wrong, or information that had already arrived was never folded in. Later tools -- Bayes, expectation, the CLT -- all assume the event model is stable. If the model shifts silently, every downstream calculation is contaminated.

This module feeds every later probabilistic argument in the degree. In Semester 2, average-case analysis of quicksort and the correctness argument for universal hashing both begin "fix a probability model over the input." In Semester 5, queueing theory starts by writing down an M/M/1 state space before computing ( \rho = \lambda / \mu ). In Semester 6, reasoning about replication lag and partition failures begins with an explicit probability space over component-failure outcomes. In Semester 8, SLO math ("99.9% of requests under 200 ms") only makes sense once you have said what "a request" is and how requests are sampled. In Semester 9, canary analysis tests a hypothesis about one sampling frame against another.

Operationally, the hardest part of any new probabilistic question is rarely the algebra -- it is deciding whether the situation really does match a well-defined probability model, and if so, which one. A model mismatch at this layer silently contaminates every later number. Every textbook formula you will learn in the remainder of the module assumes a correct model; none of them verify it for you. The skill this concept trains is the one no formula can replace.

The habit the module is trying to build is cheap and permanent: before writing any formula, write down the sample space in one line.

A concrete engineering payoff. In incident reviews, it is common to see a sentence like "the failure rate tripled last week." That sentence is only interpretable if the denominator is pinned down: failures per deploy, per request, per user, per calendar day? The probability model -- sample space plus weighting -- is the denominator discipline. Teams that name the model up front avoid arguments that are really about sample-space disagreement disguised as arguments about numbers.

Concrete Examples

Example 1 -- Two fair coin tosses. A correct sample space is [ S = {HH, HT, TH, TT}. ] The event "exactly one Head" is ( A = {HT, TH} ), so [ P(A) = \frac{|A|}{|S|} = \frac{2}{4} = \frac{1}{2}. ] Why this works: each outcome represents one trial at the same level of detail, and they are equally likely. If instead you tried to use ( {0 \text{H}, 1 \text{H}, 2 \text{H}} ) as the sample space and treat them as equally likely, you would get ( 1/3 ), which is wrong. The coarse sample space is legal -- it is still a sample space -- but it is not one on which uniform weights are justified.

Example 2 -- Load balancer hash model. Three distinct requests are hashed independently, uniformly, into two buckets. Model the sample space as the ( 2^3 = 8 ) assignment tuples [ S = {(b_1, b_2, b_3) : b_i \in {0,1}}. ] The event "all three land in the same bucket" has size ( 2 ), so ( P = 2/8 = 1/4 ). Notice the modeling decision: labeled requests, not counts by bucket. Counts would give 4 outcomes (3-0, 2-1, 1-2, 0-3) that are not equally likely. The tuple model is the one that supports symmetry.

Example 3 -- SRE incident-day model. "What is the probability that today is a SEV-1 day?" forces you to pick a sample space before the number is meaningful. Is the trial "one calendar day," "one on-call rotation," or "one deploy-event"? Each choice gives a different probability for the same English sentence. Most production-incident postmortems implicitly pick one and then compute; getting the denominator wrong is why "our incident rate dropped 50%" reports sometimes reflect only a switch from days to deploys as the sampling unit. With 90 days, 180 on-call shifts, and 2{,}400 deploys in a quarter, three SEV-1s translate to rates of ( 3/90 \approx 3.3% ), ( 3/180 \approx 1.7% ), or ( 3/2400 \approx 0.125% ) -- three different answers to the same question, distinguished entirely by the sample space.

Common Confusion / Misconceptions

"Outcome and event are the same thing." No. An outcome is a single pebble; an event is a set of pebbles. ( HT ) and ( TH ) are different outcomes of a two-toss experiment, and both belong to the same event "exactly one Head."

"Any sample space works as long as it covers all cases." Formally yes, operationally no. You need a sample space where your probability weights can be justified. A summary-level space is often wrong for the naive ( |A|/|S| ) rule.

"Symmetry is automatic." Equally-likely reasoning is only legal if the process of the experiment treats outcomes symmetrically. A spinning biased coin, a hash family that is not uniform, or a die with a chipped corner breaks symmetry even though the sample space looks clean.

"The model can be updated mid-problem without re-deriving." If you change the sample space (e.g., move from labeled requests to bucket counts), you must re-derive every probability from scratch. Silently swapping representations is a major source of off-by-Stirling-approximation errors in textbook solutions and of real bugs in simulation code.

"The probability number is the answer." Not until you have said what event it refers to in the model. A number without a stated sample space is not a probability; it is a digit.

"Probabilities must come from data." Not at this level. At the modeling layer, probabilities are assigned by a combination of symmetry, physical mechanism, or explicit assumption. Empirical estimation (Cluster 5, concept 15) is a separate step that consumes a model rather than producing one. Conflating the two is why sentences like "the probability of a 7 on two dice" feel controversial -- they are not, once the symmetry assumption is made explicit.

How To Use It

Before computing any probability, work through this checklist:

  1. State the experiment. What is one trial? One coin flip? One request? One hand of cards?
  2. List the atomic outcomes. At what level of detail? Ordered or unordered? Labeled or counted?
  3. Justify the weights. Are outcomes equally likely? If not, what rule assigns probabilities to them?
  4. Define the target event as a subset. Write it as a set, not a sentence.
  5. Re-read the problem to check the scope. Did the problem provide any conditioning information that should shrink ( S )?
  6. Compute. Only now apply formulas.
  7. Sanity-check. Is the answer between 0 and 1? Does a trivial variant give a known answer?
  8. Document the model. If you are writing a design doc, an incident review, or a capacity-planning memo, put the sample space and weighting assumptions in one sentence at the top. Readers should never have to reverse-engineer "what was a trial?" from the math.

If you cannot complete step 1-4 in words before opening a formula sheet, you are not ready to compute.

Transfer / Where This Shows Up Later

  • Semester 2 (Algorithms). Randomized quicksort is analyzed on the sample space of all pivot sequences, uniform over permutations of the input. Universal hashing is a statement about a sample space of hash functions, not of inputs.
  • Semester 5 (Systems). Queueing models begin with an explicit state space: "the number of jobs in the system" with transition rates. Every steady-state formula is a probability over that state space.
  • Semester 6 (Distributed). A replication protocol's correctness proof often states "given the failure model ( \mathcal{F} ), the probability of inconsistency is bounded by …" -- the failure model is the sample space.
  • Semester 8 (SRE / SLOs). Latency SLOs ("p99 < 200 ms") are percentiles of a specified population of requests. If the population drifts (time-of-day mix, cohort, authenticated vs anonymous), the percentile drifts with it. The sample space is an operational artifact.
  • Semester 9 (Deployments). A canary test compares two samples drawn from two populations (control and canary). The analysis is only valid when the two sample spaces are comparable -- same endpoints, same user cohort, same window.
  • Semester 2 (Randomized algorithms). A Las Vegas algorithm's sample space is the set of random seeds the algorithm consumes; correctness holds for every seed, and the running-time analysis is a probability over that seed space, not over the input.
  • Semester 6 (Databases). Isolation-level analysis is a statement about a sample space of interleaved transaction schedules. "Serializability" is the event, the set of schedules equivalent to some serial order; picking the wrong schedule space is how ad-hoc isolation arguments go wrong.
  • Semester 8 (Reliability / SLOs). Availability ("three nines," 99.9%) is a probability over a stated population of requests. If the population excludes internal health-checks but includes user-facing reads, the number changes; the model definition is the SLO's hidden contract.
  • Semester 9 (Observability). Sampled traces (e.g., 1% of requests) only produce valid aggregate statistics if the sampling sample space is known and well-behaved (uniform, stratified, or head-based). Under-documented sampling is a frequent source of phantom regressions in dashboards.

Check Yourself

  1. Give two different sample spaces for "roll two dice and report the sum." Which one supports naive equally-likely reasoning? Why?
  2. Why can a coarse sample space give a wrong probability even though it covers every case?
  3. A probability distribution is sometimes called a measure. What must be true about it when applied to disjoint events? Why is that property necessary?
  4. You observe a request and ask "was it from a new user?" What is the sample space? What changes if you condition on "the user was authenticated"?
  5. When is it legitimate to divide by ( |S| )? Give an example where it is not.
  6. A hash function maps strings into 16 buckets. What sample space underlies the claim "the probability of collision between two specific strings is ( 1/16 )"? What assumption on the hash family is doing the work?
  7. Why is it often useful to keep the sample space more fine-grained than strictly needed (e.g., labeled requests instead of counts)? What does the extra granularity buy you when the question later changes?

Mini Drill or Application

Three standard "gotchas" to notice while you work:

  • The phrase "at least one" is a trigger word -- almost always a cue to use the complement event (Concept 3).
  • "Without replacement" quietly removes independence -- the sample space shrinks trial-by-trial (Concept 6).
  • "Uniformly at random" is a claim about the process, not a synonym for "somehow chosen."

For each situation, write (a) one reasonable sample space, (b) the event of interest as a subset, (c) whether equally-likely reasoning is justified, and (d) the probability.

  1. Rolling two dice and asking for a sum of ( 7 ).
  2. Selecting one student uniformly from a class of 30 and asking whether they know Python, given that 12 do.
  3. Drawing two cards without replacement and asking for at least one ace. State the sample space in both the ordered and unordered representations and compute the probability in each. You should get the same answer.
  4. A fair coin is flipped until the first Head. What is the sample space? Is it equally likely?
  5. Requests arrive at a service at random times in ([0, 60]) seconds. State a sample space for "the arrival time of the next request" and explain why equally-likely reasoning is justified under a uniform-arrival assumption.

Simulation drill (Monte Carlo). In Python or a notebook, simulate 100,000 pairs of dice rolls and compute the empirical probability of sum ( = 7 ). Compare to ( 6/36 ). Repeat with only ( 1{,}000 ) trials and note the variability. This is your first encounter with the theme of Cluster 5: averages stabilize as ( n ) grows.

Engineering scenario. Your team is launching a new endpoint and asks "what fraction of requests will fail?" Before you produce a number, force the conversation through step 1 of the checklist: what is one trial? Is it one HTTP request (including retries)? A single TCP attempt? A user-perceived outcome (which may span retries and fall-throughs)? Writing the sample space in one sentence makes the disagreement visible and usually resolves the argument before any math is done. This is the minimum-viable probability-modeling habit; it pays back every week of on-call work.

Second engineering scenario -- hashing design review. A colleague proposes a custom hash-to-bucket scheme and claims "each request lands in a uniformly random bucket." Before accepting the claim, ask for the probability space: is the randomness over the hash function (fixed per deploy, random across deploys) or over the inputs (fixed hash, random inputs from a distribution)? The two models look alike in simple symmetric cases and diverge sharply in adversarial ones -- an attacker choosing inputs can break the "random inputs" model but not the "random hash" model. Naming which space the claim is over is exactly what distinguishes cryptographically-safe universal hashing from brittle ad-hoc hashing.

Read This Only If Stuck