Skip to main content

Flaky Test Diagnosis and Recovery

What This Concept Is

A flaky test is a test that passes on some runs and fails on others against the same code. It destroys trust in the test suite: once the habit of rerunning is formed, real failures also get reruns, and regressions slip through.

A flaky test has one underlying cause: non-determinism. Finding it is a narrow diagnostic problem. Recovery is a three-step loop:

  1. Reproduce the flake deterministically enough to study.
  2. Diagnose the non-determinism against a short list of common causes.
  3. Fix the cause (not the symptom) and let the test earn its way back into the suite.

Google's own published data on flakes (see references) is blunt: even at scale, the fix is never "retry harder"; it is to track flakes as first-class bugs, quarantine them, and either fix or delete them on a clock. The industry rule of thumb that "every flaky test is a bug, often in the production code" has held up well because the test is a witness, not a culprit.

Why It Matters Here (In the Capstone)

Capstone CI already runs slower than you would like. Once a flaky test shows up, every pipeline gets a retry and every retry costs both time and attention. A capstone with a flaky CI turns into one where nobody reads red builds, which is functionally equivalent to having no tests at all.

This module requires you to be able to diagnose a flaky test against five common causes and apply a targeted fix.

Concrete Example(s) -- from a real capstone

Flaky report: an E2E test that creates a task and asserts the list includes it fails about 1 in 20 runs with "task not found."

Step 1 -- Reproduce: run the test 50 times locally. Maybe two failures. Inspect logs from a failing run against a passing run.

Step 2 -- Diagnose: go through the five common causes in order.

Five common causes of flakiness

  1. Timing and race conditions -- the assertion runs before the side effect has committed. Common in async code, eventual-consistency paths, and UI tests.
  2. Shared mutable state between tests -- a previous test left a row in the DB, a global variable, or a cached connection that influences this test.
  3. External dependency non-determinism -- the real external API, network latency, clock skew, or DNS.
  4. Test-order dependency -- test B passes only when test A runs first (or fails only when A runs first).
  5. Environment leakage -- local runs pass because of env vars, filesystem state, or seeded data that CI does not have (or vice versa).

For each candidate cause, there is a specific diagnostic step:

  • for timing: add tracing around the side effect; look for a completion signal before the assertion runs.
  • for shared state: run the suspect test alone; run in random order.
  • for external deps: replace the external with a deterministic fake and see if flakiness disappears.
  • for test order: reverse or randomize order and see if failures track a specific ordering.
  • for environment leakage: run in a clean container with CI-like env and compare.

In the example above, random ordering finds that a previous test leaves a stale DB row. Shared state is the real cause. The fix is to make the test's DB setup clean per-test, not per-suite.

Step 3 -- Fix: apply the targeted fix, then rerun 100 times with random order to earn the test's way back into the suite.

Common Confusion / Misconceptions

The most expensive confusion is treating retries as a fix. A test that is marked "auto-retry up to 3 times" is a test you no longer trust. The information that the test provides about the system degrades with every retry. Retries are triage, not a solution.

The second confusion is assuming flakiness means the test is bad. Sometimes the test is correct and the production code is the one being flaky (a real race condition in your code), and the test is correctly reporting it. Diagnose before blaming the test.

The third is trying to silence flakiness by loosening assertions (assert x >= 0 instead of assert x == expected). That converts a test that sometimes fails into a test that can never fail, which is worse.

The fourth is quarantining silently with @pytest.mark.skip or equivalent and moving on. A skip without a ticket is a test that has been deleted without the honesty of admitting it.

How To Use It (In Your Capstone)

When a test flakes:

  1. Tag it as flaky and quarantine it -- do not let it gate the pipeline.
  2. Open a triage entry (Concept 10): Sev-2 or Sev-3 depending on impact.
  3. Reproduce with rerun loops and randomized ordering.
  4. Go through the five-cause checklist in order and apply the matching diagnostic.
  5. Apply a targeted fix at the cause, not the symptom.
  6. Prove the fix by running the test 100+ times with random order in CI before unquarantining.
  7. If unresolved after two sprints, delete the test with a ledger entry explaining why.

Quarantine is temporary. A flaky test that sits quarantined for more than a sprint is a decision to delete it or to rewrite it at a different level.

A Diagnostic Script for a Capstone

  1. Run the test 100 times locally. If it never fails locally, the non-determinism is CI-environment-specific. Jump to step 5.
  2. Run the test alone (no suite around it). If it passes consistently alone, you have shared state or order dependency.
  3. Randomize test order in the suite and rerun. If failures cluster around a specific other test running first, you have found the culprit.
  4. Introduce logging around the side effect and the assertion. Compare timestamps. If the assertion runs before the side effect completes, it is timing.
  5. Compare local env and CI env byte-for-byte. Env vars, filesystem paths, timezone, locale, container base image, connection pool size. Differences here often explain CI-only failures.
  6. Quarantine the external. Replace the real external call with a deterministic fake and rerun. If flakiness disappears, the external is the source.
  7. If all of the above fail, the test may be observing a real non-determinism in production code (a real race). The flakiness is correct reporting of a real bug.

Write the outcome of each step in the triage entry. A complete flake investigation is a genuine artefact -- share it in self-review (Concept 14).

Quarantine Protocol

  • Quarantine within 24 hours of the second failure.
  • Owner assigned with a dated commitment.
  • No retry flags. A quarantined test runs in a non-blocking lane. If it keeps failing, that is useful signal.
  • Delete if unresolved in two sprints. An unfixable flaky test has negative value. A deleted test with a ledger entry is more honest.

Anti-Patterns to Recognize

  • pytest-rerunfailures @ 3. Auto-retry hides real non-determinism.
  • Loosened assertion. Changing assert x == expected to assert x is not None to stop the flakiness.
  • Silent quarantine. A test is @pytest.mark.skip with no ticket and no owner.
  • Environment-only fix. Fix the CI env so the test passes, but never diagnose what about the test was CI-sensitive.

See also (integrative)

External references:

Check Yourself

  1. What does it mean to "fix the cause, not the symptom" for a flaky test?
  2. Why is auto-retry a red flag rather than a fix?
  3. List the five common causes of flakiness and give one diagnostic step for each.
  4. Why is "loosen the assertion" a worse fix than quarantine plus diagnose?
  5. If the diagnostic script exhausts all five causes and the flake remains, what is the most likely explanation and who should own the investigation?

Mini Drill or Application (Capstone-scoped)

  1. Pick one test in your capstone -- or introduce a deliberately flaky one -- and run it 100 times with random ordering.
  2. For each failure, write which of the five causes you suspect, the diagnostic step you applied, and the fix you shipped.
  3. Produce a one-page diagnosis report suitable for sharing with a teammate; include the triage entry id.
  4. Prove the fix with 100 randomized runs in CI and unquarantine the test with a commit message explaining the fix.
  5. Write a short paragraph in your capstone journal on which of the five causes tends to dominate in your codebase and what class of defense would reduce it further.

Source Backbone

Capstone implementation applies earlier code-quality, testing, and refactoring material. These books are the source backbone for that practice.