Test Strategy in CI: Unit, Integration, Contract, Smoke

What This Concept Is

A CI pipeline runs several kinds of tests, each with a specific job:

Unit tests. A function or class in isolation. No network, no filesystem, no database. Milliseconds per test. Thousands of them.
Integration tests. Real collaborators wired together: the service plus a real Postgres, a real message broker, a real HTTP server. Seconds per test. Tens to hundreds.
Contract tests. Assert that a producer and consumer agree on the shape of an API or event. Producer-side: "my responses match this contract." Consumer-side: "I only depend on fields in this contract." Tens of tests, run in CI for both sides.
Smoke tests. A handful of high-value sanity checks run against the deployed service. "Can I log in? Can I POST an order? Does the health endpoint return 200?" Seconds total. Five to twenty tests.
End-to-end (E2E) tests. Cross-service critical-path flows, usually through a real UI. Minutes each. Tens -- not hundreds -- in a healthy suite.

Each layer answers a different question. Mixing their jobs is the main source of slow, flaky, or misleading pipelines. The shape of the set is the test pyramid: lots of cheap unit tests at the base, fewer integration tests in the middle, a few E2E / smoke at the top.

Why It Matters Here

The CI pipeline is the team's feedback loop. Its value is proportional to how fast it gives a trustworthy verdict:

Fast. Developers wait for CI. 30-minute pipelines destroy flow and encourage people to batch commits.
Trustworthy. A flaky pipeline trains the team to ignore red. Ignored red is worse than no tests.

You get fast and trustworthy by putting each kind of test at the stage where it is cheapest and most informative. The Linux Command Line chapter on defensive programming and testing gives the baseline shell-script habits (set -euo pipefail, trap on exit, short assertions in a script) that also show up in every CI runner.

Concrete Example

A well-shaped pipeline:

Note:

Unit tests run before the build. If you cannot even compile the logic correctly, there is no point building a container.
Integration tests run after the build, against the artifact, with real services spun up via testcontainers / docker-compose / service containers.
Smoke tests run after each deploy. They are the first line of defense against misconfigured environments.

The costs roughly decimate: a unit test is free, an integration test is 100x more, an end-to-end smoke is another 100x. Put 80% of coverage in the cheapest layer.

Flaky Tests -- A Lifecycle Policy

Flakes are the single most corrosive pathology in a CI system. Google's published "Flaky Tests at Google and How We Mitigate Them" (2016, still widely cited) documents that ~16% of their tests exhibited some flakiness over time. Without a policy, flake rates climb until developers ignore red builds.

A usable three-state lifecycle:

Green. Test passes reliably. Gates merges.
Quarantined. Test fails > 1% of runs. Moved to a nightly suite within 24 hours. Does not gate merges. Owner is named on the test.
Retired. After two weeks in quarantine without a fix, either fixed or deleted. No "we'll come back to it."

Rule of thumb: the main-pipeline flake rate must stay below 0.5% of runs. Above that, retries-until-green stops being a workaround and becomes the actual test strategy, which means you have no test strategy.

Common Confusion / Misconception

"Integration tests can replace unit tests." No. Integration tests are too slow and too coarse for the feedback loop. A function with 20 branches needs 20 unit tests, not one expensive integration test that happens to exercise it.

"End-to-end tests are the most reliable." They are the least. An E2E test depends on dozens of moving parts; any of them can flake. Teams that lean heavily on E2E drift toward a red pipeline they ignore. Use E2E sparingly as smoke tests of critical paths. Google's internal guidance and Martin Fowler's TestPyramid both say the same thing: E2E is expensive, flakes worst, and should be the thinnest slice.

"Contract tests are just extra integration tests." Contract tests are versioned agreements. They run on both sides of an integration independently, so the producer and consumer can evolve without a joint test environment. Losing that property loses the value. Consumer-driven contract testing (Pact) is the most common implementation: the consumer writes the expected contract, the producer verifies it, no shared staging required.

"Flaky tests should be retried." Retries hide bugs. A flaky test is a broken test -- either the test is wrong, or the code has a real race condition. Quarantine flakes to a nightly job; do not retry-until-green in the main pipeline.

"Coverage percentage is the metric." Coverage measures lines executed, not behaviors verified. 80% coverage with only happy-path tests is meaningfully worse than 60% coverage with targeted edge cases. Use coverage as a floor check ("this module has no tests at all"), not a target.

How To Use It

Shape the test pyramid per stage:

Stage	Types	Budget	Purpose
Pre-build	lint, static checks, unit	< 2 min	catch 80% of defects cheaply
Post-build	integration, contract	< 10 min	verify wiring with real collaborators
Post-deploy (staging)	smoke, E2E critical-path	< 2 min	environment sanity
Post-deploy (prod)	smoke only	< 1 min	confirm the deploy actually works

Hard rules:

No test layer above unit may be the only coverage for a piece of logic.
No flaky test may gate merges. Fix or quarantine within 24 hours.
Smoke tests must be runnable by hand, locally, against any environment. They are the on-call team's first tool.
Parallelize aggressively -- sharding by file or time bucket turns a 20-minute suite into a 3-minute one on a modest runner fleet.
Fail fast: run the cheapest stages first; one failing lint saves a 5-minute integration run.

Check Yourself

Give one example of logic that belongs in unit tests, one in integration tests, and one in contract tests.
Why do smoke tests run after deploy and not before?
What is the difference between a contract test and a consumer-driven contract?
Why is "retry flaky tests three times" usually the wrong fix?
What rate of flakes in the main pipeline is the threshold above which retries become the actual strategy?

Mini Drill or Application

Take a real service's test suite. Classify every test as unit, integration, contract, smoke, or E2E. Plot:

what fraction is at each layer (by count and by runtime)
which tests are flaky (fail > 1% of runs)
which layer has the most flakes
the pipeline's p95 duration, compared to the budget table above

Most teams discover their pyramid is upside down (too much E2E, not enough unit) or that 20% of their tests take 80% of their CI time. Both findings are actionable.

Test Strategy in CI: Unit, Integration, Contract, Smoke

What This Concept Is

Why It Matters Here

Concrete Example

Flaky Tests -- A Lifecycle Policy

Common Confusion / Misconception

How To Use It

Check Yourself

Mini Drill or Application

Read This Only If Stuck

See also (external)

What This Concept Is​

Why It Matters Here​

Concrete Example​

Flaky Tests -- A Lifecycle Policy​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​

See also (external)​

What This Concept Is

Why It Matters Here

Concrete Example

Flaky Tests -- A Lifecycle Policy

Common Confusion / Misconception

How To Use It

Check Yourself

Mini Drill or Application

Read This Only If Stuck

See also (external)