PMFs, CDFs, and Functions of Random Variables
What This Concept Is
Once a random variable is defined, its distribution tells you how probability is spread across its values. For a discrete random variable ( X ), the two standard representations are:
- Probability Mass Function (PMF): ( p_X(x) = P(X = x) ). Probability concentrated at each specific value.
- Cumulative Distribution Function (CDF): ( F_X(x) = P(X \le x) ). Probability accumulated up to and including a threshold.
The PMF and CDF carry the same information but answer different questions. The PMF is best for exact-value probabilities, the shape of the distribution near a point, and for computing expectations (Cluster 4). The CDF is best for threshold questions, percentile computations, tail probabilities (( P(X > c) = 1 - F_X(c) )), and for comparing or transforming random variables.
Every PMF on a discrete support satisfies two simple rules: [ p_X(x) \ge 0 \text{ for all } x, \qquad \sum_x p_X(x) = 1. ] Every CDF is non-decreasing, right-continuous, satisfies ( \lim_{x \to -\infty} F_X(x) = 0 ) and ( \lim_{x \to \infty} F_X(x) = 1 ), and has jumps at exactly the points where ( p_X > 0 ). The jump size at ( x ) is ( p_X(x) ).
A function of a random variable is another random variable: if ( X ) is random and ( g ) is a deterministic function, then ( Y = g(X) ) is random. The distribution of ( Y ) is obtained by pulling back along ( g ): ( p_Y(y) = \sum_{x : g(x) = y} p_X(x) ). Transformations often change the support shape -- ( Y = X^2 ) on ( X \in {-2, -1, 0, 1, 2} ) has support ( {0, 1, 4} ), not ( {0, 1, 2, 4} ).
Why It Matters Here
Many later ideas depend on choosing the right representation:
- PMFs are best for exact-value computations and for expectations: ( E[X] = \sum_x x, p_X(x) ).
- CDFs are best for threshold and tail questions -- exactly the kind SRE and SLO work cares about ("what fraction of requests take more than 200 ms?").
- Transformed variables reuse these ideas when you define something like ( Y = g(X) ), whether ( g ) is a scaling for units (( Y = 1000 X ), seconds -> milliseconds), an aggregation (( Y = \max(X_1, \dots, X_n) )), or a thresholding (( Y = \mathbb{1}[X > c] )).
If you only memorize PMF formulas, you miss the structural picture: any question you can ask about a random variable -- probability at a point, probability in a range, probability in a tail, expectation, variance, likelihood under a model -- reduces to some operation on the PMF or CDF.
In practice, engineers work more with empirical CDFs (from observed data, via sorting) than with parametric PMFs, because production data rarely fits a named distribution exactly. The operational habit "sort your latencies and read off the percentiles" is the CDF in disguise.
Concrete Examples
Example 1 -- Three fair coin flips. Let ( X ) be the number of heads in 3 independent fair flips. The support is ( {0, 1, 2, 3} ), and by counting: [ P(X = k) = \binom{3}{k} \cdot \frac{1}{8}, \qquad k = 0, 1, 2, 3. ] So [ p_X(0) = \tfrac{1}{8},\ p_X(1) = \tfrac{3}{8},\ p_X(2) = \tfrac{3}{8},\ p_X(3) = \tfrac{1}{8}. ] The CDF accumulates: [ F_X(0) = \tfrac{1}{8},\ F_X(1) = \tfrac{4}{8},\ F_X(2) = \tfrac{7}{8},\ F_X(3) = 1. ] If you want ( P(X \le 1) ), the CDF gives it immediately as ( F_X(1) = 1/2 ). To compute ( P(X \ge 2) ), use ( 1 - F_X(1) = 1/2 ), not ( p_X(2) ), which would miss the ( X = 3 ) case.
Example 2 -- Transformation: squared residual. Let ( X ) take values in ( {-2, -1, 0, 1, 2} ) uniformly (each with probability ( 1/5 )). Define ( Y = X^2 ). The support of ( Y ) is ( {0, 1, 4} ), and by pulling back: [ p_Y(0) = p_X(0) = \tfrac{1}{5},\quad p_Y(1) = p_X(-1) + p_X(1) = \tfrac{2}{5},\quad p_Y(4) = p_X(-2) + p_X(2) = \tfrac{2}{5}. ] Notice the support has shrunk (5 values -> 3 values) and the probabilities have been combined across pre-images. You cannot simply "apply ( g ) to the support" and keep the same PMF shape -- the pre-image structure matters. A common bug: treating ( Y = X^2 ) as if ( Y ) were uniform on ( {0, 1, 4} ). It is not.
Common Confusion / Misconceptions
"PMF and CDF are interchangeable notations for the same thing." They contain the same information but answer different questions and appear in different formulas. Always use the representation that matches the question.
"( P(X = 1) ) equals ( P(X \le 1) )." The first is a single mass; the second is a cumulative probability including all smaller values. For discrete variables they are different unless the only value in the support up to 1 is exactly 1.
"Transforming ( X ) by ( g ) keeps the same PMF shape." No. The support changes and probabilities can combine across pre-images. You must carefully list ( {x : g(x) = y} ) for each ( y ) in the new support.
"A PMF value ( p_X(x) ) can exceed 1." No -- it is a probability. Contrast with a continuous density ( f_X(x) ), which can exceed 1 (Cluster 5). The distinction matters.
"The CDF is always continuous." Discrete variables have CDFs with jumps at each support point; between support points they are flat. Right-continuity is the convention (left limits give you the "strictly less than" probability).
How To Use It
When analyzing a discrete random variable:
- List the support -- the set of values with nonzero probability.
- Compute PMF values over the support, ensuring they sum to 1 as a sanity check.
- Accumulate into the CDF if threshold or tail questions matter.
- For a transform ( Y = g(X) ): enumerate ( {x : g(x) = y} ) for each ( y ), then compute ( p_Y(y) = \sum_{x : g(x) = y} p_X(x) ).
- For a scaling or shift (( Y = aX + b )), the PMF values stay the same but the support shifts and scales.
- For a monotone transform of a continuous variable (Cluster 5), use ( F_Y(y) = F_X(g^{-1}(y)) ).
- Sanity-check with ( F_X(-\infty) = 0 ), ( F_X(\infty) = 1 ), and with a simulation.
Transfer / Where This Shows Up Later
- Semester 2 (Algorithms). Running-time distributions of randomized algorithms are usually analyzed via CDFs. The "with high probability" statements ("quicksort takes ( O(n \log n) ) time with probability ( \ge 1 - 1/n )") are tail-probability statements using the CDF.
- Semester 2 (Hashing). "Max bucket load" has a complicated CDF; you almost always bound the tail rather than computing it exactly.
- Semester 5 (Queueing). The Erlang distribution and other waiting-time distributions are given as CDFs because the engineer always asks threshold questions ("what fraction of waits exceed 1 second?").
- Semester 6 (Distributed). Replication lag is quantified by the CDF of the lag random variable, often reported as percentiles (p50 lag, p99 lag). The CDF is the native language.
- Semester 8 (SLOs). The 99th percentile latency is ( F_X^{-1}(0.99) ) -- the inverse CDF at 0.99. Every SLO is an inverse-CDF statement. Error budgets are computed from the tail ( 1 - F ).
- Semester 9 (Experimentation). The Kolmogorov-Smirnov test compares two distributions by computing the maximum difference between their CDFs. Non-parametric A/B testing frequently uses CDF comparisons.
Check Yourself
- What does the CDF tell you that the PMF does not? Give one question that is natural for each.
- Why must PMF values sum to 1?
- What changes in PMF and CDF when you define ( Y = 2X + 1 )? What stays the same?
- For a fair six-sided die, write the PMF, then the CDF, then compute ( P(2 < X \le 5) ) using each.
- Give an example of a transformation ( g ) such that the number of values in the support of ( Y = g(X) ) is strictly less than that of ( X ).
- The CDF of an integer-valued variable is a step function. What is the jump size at integer ( k )? Why is the CDF right-continuous by convention?
Mini Drill or Application
Let ( X ) be the number of successes in 2 fair coin flips.
- Write the PMF.
- Write the CDF.
- Define ( Y = X^2 ). Write the PMF of ( Y ).
- Compute ( P(Y = 1) ) and ( P(Y \le 1) ).
Now let ( X ) be the number of requests that fail out of 5, with per-request failure probability 0.1.
- Write the PMF (this is a binomial distribution -- preview of the next concept).
- Compute ( P(X \ge 2) ) exactly using the CDF approach.
- Define ( Y = \mathbb{1}[X \ge 1] ). What is its PMF? (This is the "at least one failure" indicator -- a common derived random variable in monitoring.)
Simulation drill. Implement Example 1 in Python by simulating 100,000 triples of fair coin flips. Compute the empirical PMF and CDF and compare to the analytic values. Then define ( Y = X^2 ) and compute the empirical PMF of ( Y ); compare to the pulled-back calculation. This is the microcosm of every Monte Carlo workflow: simulate a primitive random variable, transform, aggregate, and read off the statistic of interest.
Read This Only If Stuck
- Introduction to Probability: Distributions and PMFs (Part 1)
- Introduction to Probability: Distributions and PMFs (Part 2)
- Introduction to Probability: Cumulative distribution functions
- Introduction to Probability: Functions of random variables (Part 1)
- Introduction to Probability: Functions of random variables (Part 2)
- MCS: Distribution Functions (Part 1)
- Wikipedia: Probability mass function -- cross-reference for the formal definition and the comparison with PDFs.
- Wikipedia: Cumulative distribution function -- useful for the right-continuity convention and percentile inversion.