End-to-End Tests: The Expensive Few

What This Concept Is

An end-to-end (E2E) test drives the system the way a real user or client does: through the UI or through the public API, against a deployed stack that is as close to production as you can afford. It is the most realistic and the most expensive kind of test you run.

E2E tests:

start the whole system (often in staging or a production-equivalent environment);
traverse every layer the walking skeleton defined;
assert user-observable outcomes, not internals;
run in seconds to minutes, not milliseconds;
are the most likely to be flaky and the most painful to debug.

Because they are expensive, you keep few of them -- and the ones you keep must protect behavior that matters enough to justify the cost. Google's Just Say No to More End-to-End Tests post is the canonical industrial warning about what happens when a team lets E2E coverage grow unchecked: pipelines slow down, retries get added to hide flakes, real failures get lost in the noise, and the suite eventually stops informing anyone.

Why It Matters Here (In the Capstone)

The opinionated split this module recommends for a capstone is roughly 70% unit, 25% integration, 5% end-to-end. The 5% is not an arbitrary number; it reflects two realities:

E2E tests are the only tests that prove the system actually works as a product;
they are also the place where flakiness, cost, and maintenance are highest.

If you try to run 30% of your tests at the E2E level, your CI becomes slow, flaky, and disliked. If you run 0%, you have no evidence the system works as deployed. The 5% is a ceiling, not a floor: fewer E2E tests are fine if the ones you keep cover the critical paths.

Rationale for the 70/25/5 split

Unit tests (70%): cheapest, fastest, biggest target -- they protect logic.
Integration tests (25%): expensive enough to limit, cheap enough to use generously at seams. Catch the majority of "my code meets the environment" bugs.
E2E tests (5%): reserve for flows the business or the user cares about: a happy-path per critical feature, plus one "the system survives an outage" smoke test.

This is not dogma -- a mobile app with complex UI might need more E2E coverage, a CLI tool with no UI might need less. It is the default to depart from with an argument, not to exceed by accident.

Concrete Example(s) -- from a real capstone

In the task-manager capstone:

a single E2E test that signs in a test user, creates a task via the real API, and sees it in the list (the "money path");
a second E2E test that exercises the sync endpoint against a staging GitHub account, confirming one real issue becomes one task;
a final smoke test that asserts GET /health returns 200 in the deployed staging environment after each deploy.

Three tests. That is roughly 5% of a suite of about 60 tests (42 unit + 15 integration + 3 E2E).

Each E2E test must have a real owner. If it breaks, someone knows immediately and triages. If it gets flaky, it is disabled with a ticket, not left broken in CI. The smoke test runs after every deploy; if it ever fails, the deploy is rolled back automatically (see M04 for the operational discipline).

Common Confusion / Misconceptions

The first misconception is that more E2E coverage is better. It is not. E2E tests degrade a suite's health: slower feedback, more flakiness, more "retry it and hope" behavior. The correct move is to push coverage downward whenever possible -- Google's post calls this "unit tests are still the most valuable test."

The second is assuming an E2E test replaces integration tests. It does not. E2E tests are so expensive that every specific behavior you verify there should also be covered by a faster test at a lower level where reasonable.

The third is confusing E2E with "runs against production." It does not have to. It runs against a production-equivalent environment that has the same wiring as production.

The fourth is using a retry flag to silence an intermittent E2E failure. Auto-retries hide non-determinism; a retried E2E test is a test you no longer trust.

How To Use It (In Your Capstone)

When adding an E2E test, ask:

Is this protecting a path a user or buyer actually cares about?
Could an integration test cover it just as well at one-tenth the cost?
Do I have a plan for the day it goes flaky?
Is the value worth more than the cost this test adds to every pipeline run?
Is the test deleting or rewriting an older E2E to keep the budget bounded?
Does it live in a separate pipeline stage so it cannot block integration or unit feedback?
Is it parameterised by environment so it can run against staging and production smoke?

If any answer is no, do not add it.

When to Depart from 70/25/5

UI-heavy capstone: a browser app with complex flows justifies more E2E coverage (maybe 10%), but push the added cost into a nightly pipeline stage.
Pure library or CLI with no UI: E2E can drop to 1-2%. The "critical flow" is a command-line invocation.
Data pipeline capstone: one E2E happy-path run per scheduled execution replaces interactive E2E.
Multi-service system: E2E across services is extremely expensive. Prefer contract tests (Concept 9) for cross-service coverage and keep E2E for one happy path per public feature.

The Quarantine Playbook

When an E2E test goes flaky (and they do):

Disable from the blocking pipeline the same day. Do not retry-and-hope.
Open a ticket with logs attached and a triage entry (Concept 10).
Move to a separate "flaky" lane that runs but does not block merges.
Fix within one sprint or delete. Long-running quarantine is abandonment.
Reinstate only after 50-100 consecutive green runs in the flaky lane.

Anti-Patterns to Recognize

E2E as primary safety net. Every test is E2E "to be sure." CI becomes 30 minutes long.
Duplicated coverage. A feature has an E2E and a redundant integration test asserting the same thing.
Silent retry. Pipeline auto-retries E2E failures three times; real failures hidden.
Orphan E2E test. Nobody owns or understands the test; every fix is "add a retry."

Check Yourself

Why is the E2E layer the narrowest band of the pyramid?
When does a feature truly need an E2E test, given that integration tests exist?
What is the default response when an E2E test goes flaky, and why is retry not a fix?
What disqualifies an E2E test from staying in the blocking pipeline?
How does your capstone's shape (UI-heavy, CLI, data pipeline, multi-service) change the 70/25/5 default?

Mini Drill or Application (Capstone-scoped)

List the three flows in your capstone that would hurt the most if broken in production. For each, decide: E2E, integration-only, or already covered.
Implement one E2E test that runs against staging. Pin it to a named test user.
Configure your CI so that integration failures block merges but E2E failures block only the deploy stage.
Draft the quarantine playbook entry in library/raw/ci.md: who owns a flaky E2E, where it goes, and the re-entry criteria.
At the end of the semester, audit: did your E2E count stay below 5% of total tests? If not, which test should have stayed at the integration level?

Source Backbone

Capstone implementation applies earlier code-quality, testing, and refactoring material. These books are the source backbone for that practice.

Software Engineering at Google - testing, review, and engineering-process backbone.
Refactoring - safe change and behavior-preserving improvement.
Good Code, Bad Code - maintainability and code-quality judgment.
Clean Code - readability and function-level craft support.

What This Concept Is​

Why It Matters Here (In the Capstone)​

Rationale for the 70/25/5 split​

Concrete Example(s) -- from a real capstone​

Common Confusion / Misconceptions​

How To Use It (In Your Capstone)​

When to Depart from 70/25/5​

The Quarantine Playbook​

Anti-Patterns to Recognize​

See also (integrative)​

Check Yourself​

Mini Drill or Application (Capstone-scoped)​

Source Backbone​