Test Strategy in CI: Unit, Integration, Contract, Smoke
What This Concept Is
A CI pipeline runs several kinds of tests, each with a specific job:
- Unit tests. A function or class in isolation. No network, no filesystem, no database. Milliseconds per test. Thousands of them.
- Integration tests. Real collaborators wired together: the service plus a real Postgres, a real message broker, a real HTTP server. Seconds per test. Tens to hundreds.
- Contract tests. Assert that a producer and consumer agree on the shape of an API or event. Producer-side: "my responses match this contract." Consumer-side: "I only depend on fields in this contract." Tens of tests, run in CI for both sides.
- Smoke tests. A handful of high-value sanity checks run against the deployed service. "Can I log in? Can I POST an order? Does the health endpoint return 200?" Seconds total. Five to twenty tests.
- End-to-end (E2E) tests. Cross-service critical-path flows, usually through a real UI. Minutes each. Tens -- not hundreds -- in a healthy suite.
Each layer answers a different question. Mixing their jobs is the main source of slow, flaky, or misleading pipelines. The shape of the set is the test pyramid: lots of cheap unit tests at the base, fewer integration tests in the middle, a few E2E / smoke at the top.
Why It Matters Here
The CI pipeline is the team's feedback loop. Its value is proportional to how fast it gives a trustworthy verdict:
- Fast. Developers wait for CI. 30-minute pipelines destroy flow and encourage people to batch commits.
- Trustworthy. A flaky pipeline trains the team to ignore red. Ignored red is worse than no tests.
You get fast and trustworthy by putting each kind of test at the stage where it is cheapest and most informative. The Linux Command Line chapter on defensive programming and testing gives the baseline shell-script habits (set -euo pipefail, trap on exit, short assertions in a script) that also show up in every CI runner.
Concrete Example
A well-shaped pipeline:
Note:
- Unit tests run before the build. If you cannot even compile the logic correctly, there is no point building a container.
- Integration tests run after the build, against the artifact, with real services spun up via testcontainers / docker-compose / service containers.
- Smoke tests run after each deploy. They are the first line of defense against misconfigured environments.
The costs roughly decimate: a unit test is free, an integration test is 100x more, an end-to-end smoke is another 100x. Put 80% of coverage in the cheapest layer.
Flaky Tests -- A Lifecycle Policy
Flakes are the single most corrosive pathology in a CI system. Google's published "Flaky Tests at Google and How We Mitigate Them" (2016, still widely cited) documents that ~16% of their tests exhibited some flakiness over time. Without a policy, flake rates climb until developers ignore red builds.
A usable three-state lifecycle:
- Green. Test passes reliably. Gates merges.
- Quarantined. Test fails > 1% of runs. Moved to a nightly suite within 24 hours. Does not gate merges. Owner is named on the test.
- Retired. After two weeks in quarantine without a fix, either fixed or deleted. No "we'll come back to it."
Rule of thumb: the main-pipeline flake rate must stay below 0.5% of runs. Above that, retries-until-green stops being a workaround and becomes the actual test strategy, which means you have no test strategy.
Common Confusion / Misconception
"Integration tests can replace unit tests." No. Integration tests are too slow and too coarse for the feedback loop. A function with 20 branches needs 20 unit tests, not one expensive integration test that happens to exercise it.
"End-to-end tests are the most reliable." They are the least. An E2E test depends on dozens of moving parts; any of them can flake. Teams that lean heavily on E2E drift toward a red pipeline they ignore. Use E2E sparingly as smoke tests of critical paths. Google's internal guidance and Martin Fowler's TestPyramid both say the same thing: E2E is expensive, flakes worst, and should be the thinnest slice.
"Contract tests are just extra integration tests." Contract tests are versioned agreements. They run on both sides of an integration independently, so the producer and consumer can evolve without a joint test environment. Losing that property loses the value. Consumer-driven contract testing (Pact) is the most common implementation: the consumer writes the expected contract, the producer verifies it, no shared staging required.
"Flaky tests should be retried." Retries hide bugs. A flaky test is a broken test -- either the test is wrong, or the code has a real race condition. Quarantine flakes to a nightly job; do not retry-until-green in the main pipeline.
"Coverage percentage is the metric." Coverage measures lines executed, not behaviors verified. 80% coverage with only happy-path tests is meaningfully worse than 60% coverage with targeted edge cases. Use coverage as a floor check ("this module has no tests at all"), not a target.
How To Use It
Shape the test pyramid per stage:
| Stage | Types | Budget | Purpose |
|---|---|---|---|
| Pre-build | lint, static checks, unit | < 2 min | catch 80% of defects cheaply |
| Post-build | integration, contract | < 10 min | verify wiring with real collaborators |
| Post-deploy (staging) | smoke, E2E critical-path | < 2 min | environment sanity |
| Post-deploy (prod) | smoke only | < 1 min | confirm the deploy actually works |
Hard rules:
- No test layer above unit may be the only coverage for a piece of logic.
- No flaky test may gate merges. Fix or quarantine within 24 hours.
- Smoke tests must be runnable by hand, locally, against any environment. They are the on-call team's first tool.
- Parallelize aggressively -- sharding by file or time bucket turns a 20-minute suite into a 3-minute one on a modest runner fleet.
- Fail fast: run the cheapest stages first; one failing lint saves a 5-minute integration run.
Check Yourself
- Give one example of logic that belongs in unit tests, one in integration tests, and one in contract tests.
- Why do smoke tests run after deploy and not before?
- What is the difference between a contract test and a consumer-driven contract?
- Why is "retry flaky tests three times" usually the wrong fix?
- What rate of flakes in the main pipeline is the threshold above which retries become the actual strategy?
Mini Drill or Application
Take a real service's test suite. Classify every test as unit, integration, contract, smoke, or E2E. Plot:
- what fraction is at each layer (by count and by runtime)
- which tests are flaky (fail > 1% of runs)
- which layer has the most flakes
- the pipeline's p95 duration, compared to the budget table above
Most teams discover their pyramid is upside down (too much E2E, not enough unit) or that 20% of their tests take 80% of their CI time. Both findings are actionable.
Read This Only If Stuck
- The Linux Command Line: Defensive programming and testing
- The Linux Command Line: Test cases and examining values during execution
See also (external)
- Martin Fowler: TestPyramid -- the foundational piece on test shape
- Martin Fowler: ContractTest -- definition and rationale
- Martin Fowler: EradicatingNonDeterminism -- the essay on flaky tests
- Pact -- consumer-driven contract testing -- canonical tool, good docs
- Testcontainers -- real-service integration tests without a shared environment
- Google Testing Blog -- Flaky Tests at Google -- data on flake rates at scale