Smoke Tests After Each Deploy
What This Concept Is
A smoke test is a small, real check that runs against a real environment after a deploy, not before a merge. It answers one question: "did the thing we just deployed actually start serving traffic correctly?"
Three checks are enough for a capstone:
- Liveness: the process answers an HTTP request on its health endpoint with
200. Confirms the container started, ports are bound, the runtime is alive. - Readiness with a real dependency: an endpoint that exercises the database (and other critical dependencies) returns expected shape -- not just
200, but a JSON body with"db":"ok". Confirms wiring. - Critical user path: one authenticated request that represents a core feature (e.g.,
POST /login+GET /me). Confirms end-to-end business behavior against real data.
These are not unit tests. They run in production. They are cheap, broad, and wrong-signal-cheap-to-investigate. Google SRE calls this category "production probes" and treats it as a first-class reliability tool, distinct from unit/integration testing.
Why It Matters Here (In the Capstone)
Smoke tests are the evidence that a deploy succeeded in reality, not just in CI. They are what turns "my pipeline went green" into "users can use the system." Every subsequent piece of this module -- rollback triggers, release notes, runbook -- depends on smoke tests existing.
Without post-deploy smoke, a deploy is a hope. Env vars could be missing, the new binary could fail to connect to the DB on a specific TLS flag, the rewrite rule for a new endpoint could be wrong, the DNS might still point at the blue revision. Unit and integration tests in CI cannot catch any of those; only a smoke against the real environment can.
Concrete Example(s)
A smoke script that runs as the last pipeline step:
#!/usr/bin/env bash
set -euo pipefail
BASE_URL="${1:-https://api.capstone.example.com}"
TIMEOUT=10
# 1. Liveness
curl -fsSL --max-time "$TIMEOUT" "$BASE_URL/healthz" > /dev/null
echo "[ok] liveness"
# 2. Readiness with DB
READY=$(curl -fsSL --max-time "$TIMEOUT" "$BASE_URL/readyz")
echo "$READY" | grep -q '"db":"ok"' || { echo "[fail] db not ok: $READY"; exit 2; }
echo "[ok] readiness with db"
# 3. Critical path: login smoke user, fetch /me
TOKEN=$(curl -fsSL --max-time "$TIMEOUT" -X POST "$BASE_URL/login" \
-H "Content-Type: application/json" \
-d '{"email":"smoke@capstone.test","password":"'"${SMOKE_PW}"'"}' \
| jq -r .token)
[[ -n "$TOKEN" && "$TOKEN" != "null" ]] || { echo "[fail] login returned no token"; exit 3; }
ME=$(curl -fsSL --max-time "$TIMEOUT" "$BASE_URL/me" -H "Authorization: Bearer $TOKEN")
echo "$ME" | grep -q '"email":"smoke@capstone.test"' || { echo "[fail] /me wrong shape"; exit 4; }
echo "[ok] critical path"
echo "SMOKE PASS"
In the workflow:
- name: Smoke test
env:
SMOKE_PW: ${{ secrets.SMOKE_PW }}
run: ./scripts/smoke.sh https://api.capstone.example.com
timeout-minutes: 3
If any check fails, the deploy job fails, which triggers the rollback path from concept 10. For a canary or blue/green deploy, the smoke also gates traffic cut-over:
./scripts/smoke.sh "$CANARY_URL" && gcloud run services update-traffic capstone-api --to-latest
A readiness endpoint that actually reports its truth:
app.get("/readyz", async (_req, res) => {
const db = await checkDb().catch((e) => e.message);
const cache = await checkCache().catch(() => "unavailable");
const ok = db === "ok";
res.status(ok ? 200 : 503).json({ db, cache });
});
Common Confusion / Misconceptions
- "Smoke tests are a subset of integration tests." No. Integration tests run in CI against a local or staging service, with fixtures. Smoke tests run against prod (or the environment you just deployed) with real secrets and real infrastructure. The purpose is different: integration tests catch bugs in code; smoke tests catch bugs in deployment.
- "
/healthzreturning200is enough."/healthzoften only checks that the process is alive. It does not check that the new code is wired correctly to the database, the secret store, or the external dependencies. You need at least one check that exercises the real data path. - "Smoke tests should run against every endpoint." No -- that is load testing with extra steps. Smoke covers the 2-3 paths whose failure would be catastrophic; everything else is caught by monitoring and alerts within minutes, not seconds.
- "A smoke failure is a rollback." Usually, but not always. Distinguish between "the deploy failed" (rollback) and "the smoke user's password expired" (fix smoke, re-run). The smoke script's exit code should be trustable; a flaky smoke erodes the entire discipline.
- "Smoke can live outside the pipeline." It can, but it should also run inside the pipeline as the final deploy step. External probes catch ongoing outages; in-pipeline smoke catches bad deploys before traffic shifts.
How To Use It (In Your Capstone)
- Write
/healthz(liveness) and/readyz(readiness) endpoints in the app;/readyzshould actually talk to the DB and secret store. - Create a dedicated smoke user (deactivate-able) per environment, with a password stored only in the secret store.
- Commit the smoke script under
scripts/smoke.sh. Run it as the last deploy step, with a hard timeout. - Keep total smoke time under 30 seconds. Anything slower belongs in post-deploy monitoring, not the deploy gate.
- Review the smoke script any time a new major endpoint ships -- the critical path may have moved; the old smoke is now incomplete.
- For canary/blue-green deploys, run smoke against the new revision before shifting traffic.
- Log the smoke result in the release note (concept 14) so the change log answers "was this deploy verified?"
Smoke-as-Continuous-Probe
The same smoke script can double as a continuous probe run every 1-5 minutes from an external location (Cloud Monitoring uptime check, AWS Route53 health check, a cheap cron on an outside box). In-pipeline smoke catches bad deploys at deploy time; continuous smoke catches drift or external-dependency failures between deploys. Use both, and make sure they share the exact same smoke user and critical path so findings are comparable.
See also (integrative)
- S9 M04 Cluster 5: Observability in delivery -- deploy-verification signals
- S9 M05 Cluster 4: Metrics -- cardinality, exemplars, USE/RED -- the post-deploy metrics that complement a one-shot smoke
- S9 M05 Cluster 5: Alerting on symptoms, not causes -- smoke tests symptoms at deploy time
- S8 M04 Cluster 3: SLIs, SLOs, error budgets -- the smoke's critical path often is the primary SLI
- S8 M04 Cluster 5: Observability -- three pillars -- smoke evidence lives with logs and traces in the incident story
- Google SRE Book: Testing for Reliability -- production probes and stress tests as first-class reliability tools
- Google SRE Workbook: Implementing SLOs -- the smoke's path is often an SLO path
- AWS Well-Architected: Operational Excellence -- post-deploy verification expectations
- Kubernetes: liveness, readiness, startup probes -- canonical semantics of the two endpoint kinds
Check Yourself
- Which three checks does your smoke test run, and which one has failed most recently?
- Which smoke failure would trigger an automatic rollback? Which would not (false positive)?
- Where does the smoke user's password live, and who can rotate it?
- What is the smoke's hard timeout, and when was it last exceeded?
- Does
/readyzactually hit the DB, or does it just return 200? How did you verify? - For a canary, does smoke run against the canary URL before traffic shift, or after?
Mini Drill or Application (Capstone-scoped)
- Three-check smoke (45 min). Add
/healthzand/readyzto the app, create a smoke user, commitscripts/smoke.sh, wire it as the last deploy step. - Red-on-purpose. Deliberately break one endpoint (misspell a DB column in
/readyz) in a branch, push, and watch the pipeline fail at the smoke step. Confirm the failure exits with a nonzero code and prints a legible message. - Smoke audit. Review the smoke after a new endpoint ships. Does the critical path still reflect a real user flow? If not, update the smoke and note the change in
CHANGELOG.md.
Source Backbone
Capstone deployment applies cloud, delivery, and operations material. These books are the source backbone for the delivery decisions.
- Building Secure and Reliable Systems - secure/reliable deployment posture.
- GitHub Actions in Action - workflow automation support.
- Pro Git - release history, tags, and branch discipline.
- The Linux Command Line - shell and deployment automation support.