Skip to main content

Observability in Delivery: Pipeline Metrics and Deployment Markers

What This Concept Is

Observability is not just for production services. The delivery pipeline itself is a production system that benefits from the same instrumentation.

Two practices matter most:

  • Pipeline metrics. Emit timeseries for each pipeline run: duration per stage, success/fail rate, flaky-test count, queue time on the runner. Treat them the same as service metrics -- dashboards, SLOs, alerts.
  • Deployment markers. Emit an event to your observability tool every time a deploy finishes. That event carries the artifact version, environment, commit SHA, and deployer. It is overlaid on service dashboards so every latency graph, error-rate graph, and trace flow is annotated with "v2.3.0 deployed here."

Together they answer two operational questions instantly: is our pipeline healthy? and did this incident start right after a deploy?

The OpenTelemetry project has published semantic conventions for CI/CD attributes (cicd.pipeline.name, cicd.pipeline.run.id, etc.), so pipeline traces can be queried the same way service traces are -- one vendor-neutral schema across GitHub Actions, GitLab, CircleCI, Buildkite.

Why It Matters Here

This concept is supporting -- it does not introduce a new delivery primitive, it closes the feedback loop on everything the other concepts set up:

  • DORA metrics (concept 3) are derived from these signals
  • rollback triggers (concept 9) need precise deploy timestamps to correlate
  • change management (concept 14) needs a machine-readable record per release
  • incident response needs to answer "when did this start? did we just deploy?" in seconds

A delivery system without observability is like a service without logs. You can still run it, but you cannot diagnose it.

Concrete Example: Pipeline Metrics

Emit metrics from the pipeline itself. Example for GitHub Actions using a simple webhook + Prometheus pushgateway:

- name: Report pipeline metrics
if: always()
run: |
cat <<EOF | curl --data-binary @- \
http://pushgateway:9091/metrics/job/gha/repo/${{ github.repository }}
# TYPE pipeline_stage_duration_seconds gauge
pipeline_stage_duration_seconds{stage="test",status="${{ job.status }}"} $SECONDS
# TYPE pipeline_run_total counter
pipeline_run_total{workflow="${{ github.workflow }}",status="${{ job.status }}"} 1
EOF

The cat <<EOF + curl --data-binary pattern is the Linux Command Line here-document technique -- useful across CI systems for shaping and shipping metric or log payloads from shell.

Native alternatives: GitHub's built-in metrics via the REST API (GET /repos/:owner/:repo/actions/runs), Datadog's CI Visibility, GitLab's CI/CD analytics. All solve the same problem -- capture duration and status per run.

Dashboards to build:

  • pipeline success rate (by workflow, 7-day rolling)
  • p95 pipeline duration (by workflow, 7-day rolling)
  • flaky-test rate (tests that passed on retry)
  • queue time (runner wait, not job duration)

Treat them as service SLOs. If the pipeline is failing 10% of runs, that is the same severity as a production service failing 10% of requests -- the team's feedback loop is broken.

Concrete Example: Deployment Markers

Annotate every production deploy into the observability tool.

Datadog event:

- name: Emit Datadog deploy event
run: |
curl -X POST "https://api.datadoghq.com/api/v1/events" \
-H "DD-API-KEY: ${{ secrets.DD_API_KEY }}" \
-d '{
"title":"Deployed api v2.3.0 to production",
"text":"Commit ${{ github.sha }} by ${{ github.actor }}\nRollback: kubectl rollout undo deploy/api",
"tags":["env:production","service:api","version:v2.3.0","deploy"],
"alert_type":"info"
}'

Grafana annotation (for Prometheus users):

- name: Add Grafana annotation
run: |
curl -X POST "${{ env.GRAFANA }}/api/annotations" \
-H "Authorization: Bearer ${{ secrets.GRAFANA_TOKEN }}" \
-d '{
"dashboardUID":"api-overview",
"time": '$(date +%s%3N)',
"tags":["deploy","production","v2.3.0"],
"text":"Deployed api v2.3.0 (${{ github.sha }})"
}'

Effect: every graph of error_rate{service=api}, latency_p99{service=api}, etc. shows a vertical line at every deploy, labeled with the version. When an on-call engineer looks at a spike, the deploy is right there on the same graph.

Pipeline Observability as SLO

Treat pipeline metrics like service metrics: define an SLO, alert on breach.

Example SLO pack for a shared team pipeline:

SLOTargetAlert
Pipeline success rate> 95% over 7 dayspage on 30-min burn-rate > 14.4x
p95 pipeline duration< 10 min over 7 daysticket if 90th percentile > 15 min
Flake rate< 0.5% of test runsticket; auto-quarantine flake > 1%
Runner queue timep95 < 30 secondsticket if queue > 2 min

When one of these burns, it is a delivery outage: the team cannot ship. Treat it with the same severity as a customer-facing outage -- because upstream, it is.

Common Confusion / Misconception

"The CI provider's dashboard is enough." It tells you this pipeline run. It does not answer "how has our pipeline changed over 30 days?" or "which tests are flaky across all services?" For those, export the metrics.

"Deploy markers are cosmetic." Most post-incident reviews start with "what changed recently?" Markers compress that question from ten minutes of cross-referencing deploy logs to a glance at the dashboard. Incident MTTR (one of your DORA metrics, concept 3) drops measurably when markers are consistent.

"We'll just grep git log for the deploy time." Git log has commit times, not deploy times. The same commit can be deployed weeks later (or never, for reverted code). The deploy event is a different fact, and it must be recorded separately.

"Pipeline SLOs are nice-to-have." Pipeline outages directly translate to delivery outages -- your team cannot ship. When the pipeline is red for 4 hours on a Tuesday, that is 4 hours of lead-time regression invisible to production dashboards.

"Flag flips don't need markers." They absolutely do. A flag flip is a deploy of behavior, not artifacts. Concept 8 ends at "emit an observability marker" for exactly this reason -- otherwise the "what changed right before the spike" question becomes unanswerable.

How To Use It

A minimal delivery observability setup:

  1. Emit a deploy event with env, service, version, commit, actor to your observability tool on every successful production deploy.
  2. Annotate the service's key dashboards with those events.
  3. Export pipeline duration and status to a timeseries store.
  4. Build one dashboard per repo/pipeline showing success rate, p95 duration, flake rate, queue time.
  5. Define one SLO: e.g. "pipeline success rate > 95% over 7 days."
  6. Alert when the SLO is at risk, the same way you would for any service.
  7. Adopt OpenTelemetry's CI/CD semantic conventions if you want the dashboards to survive a CI provider migration.

Check Yourself

  1. What question does a deployment marker answer faster than any log search?
  2. Why is CI-provider-native dashboards not enough for pipeline observability?
  3. Name one pipeline SLO you would define and one signal that would put it at risk.
  4. What is the difference between commit time and deploy time, and why does it matter?
  5. Why should flag flips emit deploy markers too?

Mini Drill or Application

For one service:

  • identify the observability tool used (Datadog, Grafana, New Relic, CloudWatch)
  • add a step to the production deploy job that emits a deploy marker with service, env, version, commit, actor
  • overlay the marker on at least one dashboard
  • pull pipeline runs from the last 30 days and compute success rate and p95 duration

Find one pipeline weakness this reveals that you did not see before (e.g. "Fridays are 3x slower," "integration stage times out monthly on the same test").

Read This Only If Stuck

See also (external)