Skip to main content

Independence Is Information Neutrality, Not Disjointness

What This Concept Is

Two events ( A ) and ( B ) are independent when learning one does not change the probability of the other: [ P(A \mid B) = P(A) \quad \Longleftrightarrow \quad P(A \cap B) = P(A), P(B). ] Independence is information neutrality: evidence that ( B ) occurred gives you zero evidence about whether ( A ) occurred. Notice how sharply this differs from disjointness: two events are disjoint if ( A \cap B = \emptyset ), which means learning that ( B ) happened tells you ( A ) definitely did not -- that is maximum informational dependence, not independence.

When three or more events are involved, two distinct notions appear:

  • Pairwise independence: every pair ( (A_i, A_j) ) is independent.
  • Mutual independence: every subfamily factors, i.e., ( P!\left(\bigcap_{i \in I} A_i\right) = \prod_{i \in I} P(A_i) ) for every ( I ).

Mutual independence is strictly stronger. A famous counterexample: toss two fair coins; let ( A_1 = {\text{first is H}} ), ( A_2 = {\text{second is H}} ), ( A_3 = {\text{both same}} ). Each pair is independent, but all three together are not -- knowing any two determines the third.

There is also conditional independence: ( A ) and ( B ) are conditionally independent given ( C ) when ( P(A \cap B \mid C) = P(A \mid C) P(B \mid C) ). Two events can be dependent unconditionally but independent given a third event that explains their correlation -- or vice versa. This is the math behind "controlling for a variable."

Why It Matters Here

Independence drives simplification. Without it, almost nothing factors. With it:

  • product-rule probability: ( P(A \cap B) = P(A) P(B) )
  • repeated trials (Bernoulli, binomial)
  • variance of a sum is the sum of variances (Cluster 4)
  • Chernoff and Hoeffding bounds (Semester 2's randomized-algorithm analysis)
  • the central limit theorem (Cluster 5)

If you confuse independence with disjointness, nearly every later model breaks. And if you silently assume independence where it does not hold, your variance bounds understate the real spread -- a specific failure mode that causes catastrophic under-estimation of tail risk in distributed systems.

This concept is quiet but load-bearing: it is the reason Bloom filters, universal hashing, the analysis of randomized consensus protocols, and the math of correlated failures all have the structure they do.

An operational framing: independence is an assumption budget. Every independence claim buys you algebra (products, variance sums, exponential tail bounds) but also incurs a liability -- if the assumption is wrong, all downstream numbers are wrong in the same direction. Engineers who treat independence as a modeling choice to be justified, not a default to be assumed, avoid a large class of production surprises.

Concrete Examples

Example 1 -- Fair coin flips (independent). Toss a fair coin twice. Let ( A ) = "first flip is H" and ( B ) = "second flip is H." Then [ P(A) = P(B) = \tfrac{1}{2}, \qquad P(A \cap B) = \tfrac{1}{4} = \tfrac{1}{2} \cdot \tfrac{1}{2}. ] The product rule holds, so ( A ) and ( B ) are independent.

Now compare with ( C ) = "exactly one H" and ( D ) = "two H's." ( P(C) = 1/2 ), ( P(D) = 1/4 ), but ( P(C \cap D) = 0 \neq (1/2)(1/4) ). So ( C ) and ( D ) are disjoint but not independent -- in fact they are maximally informative about each other: knowing ( C ) occurred tells you ( D ) did not.

Example 2 -- Correlated server failures. A shared power supply feeds two servers. Individually, each server fails on a given day with probability ( 0.01 ). Without the shared power, the probability both fail is ( (0.01)^2 = 10^{-4} ). With the shared power -- say it fails 0.1% of days and causes joint failure whenever it does -- the true joint probability is [ P(\text{both fail}) \ge P(\text{power fail}) = 10^{-3}, ] a factor of 10 higher than the independence estimate. This is the correlated-failure fallacy that shows up in every real-world redundancy analysis: assuming independence across components sharing power, network, cooling, or human operators systematically understates joint failure. RAID vendors have lost reputation by publishing MTBF numbers that assumed disk failures were independent; they are not, especially after the first disk in a batch fails.

Example 3 -- Conditional independence via a common cause. Let ( A ) = "service A latency is high today" and ( B ) = "service B latency is high today." Marginally, ( A ) and ( B ) are correlated, because both are affected by a shared dependency ( C ) = "database is slow today." But given ( C ) (and given "not ( C )"), the two services are effectively independent: once you know the DB state, knowing A's latency tells you nothing more about B. In math: [ P(A \cap B) \neq P(A) P(B), \quad \text{but} \quad P(A \cap B \mid C) = P(A \mid C), P(B \mid C). ] This is why a lurking common cause can make two variables look dependent when they are actually conditionally independent; controlling for the common cause restores factorization. The pattern appears constantly in observability: any two metrics that depend on a common upstream look correlated until you slice by that upstream.

Common Confusion / Misconceptions

"Disjoint means independent." Exactly backwards. Disjoint positive-probability events are never independent: learning one occurred tells you the other did not. Independence is no-information; disjointness is total-information.

"Independent pairs imply independent triples." No. Pairwise independence is strictly weaker than mutual independence. The Bloom filter analysis depends specifically on mutual independence of the hash functions -- pairwise is not enough.

"Sampling without replacement preserves independence." No. Drawing two cards without replacement makes the second draw dependent on the first. This is why hypergeometric and binomial differ (Cluster 3 concept 9).

"Zero covariance implies independence." Independence implies zero covariance, but not vice versa. Two random variables can have zero covariance yet be fully deterministic in each other. This gets formalized in Cluster 4 on covariance.

"Conditional independence is the same as independence." No. ( A ) and ( B ) can be independent but conditionally dependent given some ( C ) (Berkson's paradox), or dependent but conditionally independent given ( C ) (Simpson-type behavior). The causal structure matters.

"Independence composes freely." A common slip: "the two replicas are independent, and the two regions are independent, so the whole system is independent." No -- independence is not transitive, and independence of pairs does not license arbitrary composition. Each joint distribution has to be checked in its own right.

"Two samples from the same distribution are independent." Not automatically -- identically distributed is not the same as independent. The phrase "i.i.d." exists precisely because you have to state both.

How To Use It

To test independence:

  1. Compute ( P(A) ) and ( P(B) ). Or use known distributions if the variables come from a parametric family.
  2. Compute ( P(A \cap B) ) directly, if possible.
  3. Compare with the product ( P(A) P(B) ). Independence holds iff equality holds.
  4. Alternatively, check ( P(A \mid B) = P(A) ). Often easier verbally: does knowing ( B ) change your belief about ( A )?
  5. In multistage problems, ask whether the condition changes future probabilities. If it does, independence is gone.
  6. For three or more, check all subfamilies unless the problem explicitly states mutual independence (as most well-modeled problems do).
  7. Before assuming independence in a real system, enumerate shared components: power, network, cooling, human operators, shared libraries, shared DNS. Any shared failure domain kills independence.
  8. Write the assumption down. "We assume replica failures are independent, conditional on no shared-infra outage." That framing makes the assumption audit-able -- someone can later test whether it is actually true in production data.
  9. Test empirically when stakes are high. Pull historical outage data; count joint failures; compare to what independence would predict. A gap of 10x or more almost always means a hidden common cause.

Transfer / Where This Shows Up Later

  • Semester 2 (Algorithms). Universal hashing requires pairwise independence of hash functions to control collision probability. Some stronger guarantees (e.g., concentration via Chebyshev) require higher-order independence.
  • Semester 2 (Randomized algorithms). Chernoff and Hoeffding bounds require independence (or at least negative association) to give exponential tail bounds. Without it, you get only the much weaker Chebyshev bound.
  • Semester 5 (Systems). M/M/1 queue analysis assumes memorylessness (a continuous-time independence property). Violating it -- e.g., with bursty arrivals -- changes the steady-state formulas substantially.
  • Semester 6 (Distributed systems). Availability math like "99.99% available" for a replicated system assumes independent replica failures. Real failure distributions are heavily correlated by shared infrastructure, which is why the actual availability falls short of the theoretical formula.
  • Semester 8 (Error budgets). If a user request touches ( k ) services with independent error probabilities ( p_i ), the request error rate is ( 1 - \prod(1-p_i) ). Most incidents come from failures of this independence assumption: a shared dependency goes down, and error rates across multiple services spike together.
  • Semester 9 (Experiments). A/B test statistical validity assumes users' outcomes are independent. Shared cache, shared feature flag behavior, or clustered assignment (e.g., assigning by team, not user) breaks this and inflates the effective sample size.
  • Semester 9 (Observability). Many anomaly detectors flag joint metric deviations as "independent signals" and multiply their p-values. If the metrics are correlated through a common cause, the combined "signal" is spurious; this is a frequent source of false-positive alerts.

Check Yourself

  1. Why can disjoint positive-probability events not be independent? State the proof in one line.
  2. What does independence say in words, without formulas?
  3. Why does sampling without replacement usually destroy independence?
  4. Construct three events that are pairwise independent but not mutually independent (hint: two-coin example above).
  5. Two services each fail independently with probability ( p ). A third service shares a dependency with both. Model the joint failure probability carefully. How much does the shared dependency matter?
  6. A Bloom filter uses ( k ) hash functions on a single key; the false-positive probability computation assumes the hash outputs are mutually independent. Where would this assumption fail in a real implementation?
  7. Give an example of two events that are independent marginally but become dependent once you condition on a third event. (Hint: selection / explaining-away / Berkson.)
  8. Two services have individual failure probability ( 0.005 ). Under an independence assumption the joint failure probability is ( 2.5 \times 10^{-5} ). Production data shows joint failures happen at about ( 5 \times 10^{-4} ). What is the minimum probability of a shared cause that explains this?
  9. If ( X \sim \text{Bernoulli}(p) ) and ( Y = 1 - X ), compute ( \operatorname{Cov}(X, Y) ). Are ( X ) and ( Y ) independent? (This is a check on the "zero covariance ⇒ independence" misconception.)
  10. In a Bloom filter implementation, someone replaces the ( k ) independent hash functions with ( k ) deterministic transformations of a single hash output (e.g., bit slices). Under what insertion pattern does this still behave approximately independently, and when does it break?

Mini Drill or Application

For each pair, decide whether the events are likely to be disjoint, independent, neither, or impossible to tell without more modeling. Justify in one sentence:

  1. First die is even; second die is even (two fair rolls).
  2. A card is a heart; a card is a king (one draw).
  3. First draw is red; second draw is red, without replacement.
  4. Server A fails today; server B fails today after a shared power event.
  5. User clicks "Like"; user shares the post (model these as correlated through interest in the content).

Diagnostic checklist before assuming independence. Tick off:

  • Are the two events driven by the same hardware or shared tenancy? If yes, suspect correlation.
  • Do they share a human operator or deploy pipeline? Shared humans are a frequent hidden dependency.
  • Is the sampling "with replacement" (independent-friendly) or "without replacement" (dependence)?
  • Do historical co-occurrences happen more often than the product rule predicts? Ratio ( \ge 2 ) is a red flag.
  • If the answer to any is yes, replace independence with conditional independence given the shared factor, and model the shared factor explicitly.

Engineering scenario -- availability math. Three replicas, each with 99% availability. If replicas were independent, system availability would be ( 1 - (0.01)^3 = 99.9999% ) ("six nines"). In reality, a shared rack switch causes 0.5% of outages to hit all three. The effective availability is closer to ( 1 - 0.005 = 99.5% ). The math looks like six nines only because the independence assumption was unexamined; the shared-switch term dominates. Fix: model the failure as a mixture -- with small probability, all three go down; otherwise, failures are independent -- and plan capacity and SLOs around the larger number.

Simulation drill -- the Bloom filter. A Bloom filter with ( m = 1000 ) bits and ( k = 5 ) hash functions inserts ( n = 300 ) items. Under the independence-of-hash-outputs assumption, the false-positive probability is approximately ( (1 - e^{-kn/m})^k ). Implement this Bloom filter in Python using a pseudo-random hash family, insert 300 random items, and then query 10,000 random items not in the set. What fraction are false positives? Compare to the analytic estimate. If your hash family fails to produce independent-looking outputs (e.g., you use a bad hash), you will see the false-positive rate deviate.

Read This Only If Stuck