DORA Metrics and the Four Keys

What This Concept Is

DORA (DevOps Research and Assessment, now part of Google Cloud) spent nearly a decade measuring what actually correlates with software delivery performance. The four headline metrics -- the "four keys" -- are:

Deployment Frequency. How often you ship to production.
Lead Time for Changes. From code commit to running in production.
Change Failure Rate. Percentage of deployments that cause a degraded service or require a fix (hotfix, rollback, patch).
Failed Deployment Recovery Time (formerly MTTR -- Mean Time To Restore). How long from a service-impacting failure to recovery.

The first two measure throughput. The last two measure stability. DORA's central finding is that throughput and stability are not a tradeoff -- the highest-performing teams are best on all four. The conclusion is stated plainly in the annual State of DevOps reports, and defended in the book Accelerate (Forsgren, Humble, Kim, 2018), which is the most cited academic reference on this topic in industry.

A fifth metric -- Reliability -- was added in 2021 to capture operational quality beyond pure deploy safety (availability, latency, error budgets). Some teams report "four keys + reliability"; others hold a strict four.

Why It Matters Here

These four are the only widely accepted empirical measure of delivery health. They let you:

compare two teams' delivery systems honestly, not by vibes
diagnose where your pipeline is weak (long lead time? high change-fail? slow recovery?)
justify investment in CI/CD to non-technical stakeholders
resist process growth that improves one metric while silently harming another

They also explicitly validate the small-batch, trunk-based, feature-flagged model. Teams that adopt TBD, short-lived branches, and automated rollback move from Low/Medium into High/Elite bands as published by DORA's annual State of DevOps report. Every year since 2018, the "Elite" band's defining property has stayed the same: on-demand deploys, sub-day lead time, sub-5% change-fail rate, sub-1-hour recovery.

Concrete Example

A hypothetical service report:

Metric	Value	DORA Band
Deployment Frequency	3 deploys/week	Medium
Lead Time for Changes	2 days	Medium
Change Failure Rate	25%	Low
Failed Deployment Recovery Time	4 hours	Low

Reading this honestly: the team is deploying often enough, but 1 in 4 deploys hurts users and recovery is slow. The intervention is not "deploy less often" (that would make lead time worse). The intervention is probably: add a canary, automate rollback, and require a post-deploy smoke test. That targets change-fail rate and recovery without touching throughput.

Same team, six months later:

Metric	Before	After	Band shift
Deployment Frequency	3/week	5/day	Medium -> Elite
Lead Time	2 days	3 hours	Medium -> Elite
Change Failure Rate	25%	4%	Low -> Elite
Recovery Time	4 hours	12 min	Low -> Elite

The throughput gains came because stability improved: a canary + automated rollback made smaller deploys safe, which made more frequent deploys socially acceptable.

Common Confusion / Misconception

"Lead Time = time to merge." No. DORA's lead time is commit to production, not commit to merge. Including deploy time is the whole point; otherwise a team that merges fast but deploys weekly looks better than it is.

"MTTR means average incident length." Close but not quite. DORA measures specifically failed-deployment recovery -- time from a deploy-caused incident to recovery. It is a deployment-safety metric, not a general ops metric. Don't conflate it with MTTR as used in site reliability engineering, which is any-incident recovery.

"Higher deploy frequency automatically means higher quality." Deployment frequency alone is gameable -- a team can deploy hourly with a 50% failure rate. The four metrics are meaningful together; any one in isolation is misleading.

"DORA works the same for any team." The definitions need calibrating. For a platform team, a "deployment" might be a Terraform apply. For a data team, a pipeline promotion. For a library team, a published release. The principles generalize; the events you count differ.

The DORA Shortcoming You Should Name

DORA is a system-level metric, not a team-level performance review. Its most cited misuse is grading individual engineers, or ranking teams against each other across very different domains (payments vs internal tooling vs ML training). The original authors have repeatedly warned against this. The metrics answer "is our delivery system healthy?" not "who is a good engineer?" Using them for performance reviews rapidly destroys their value because teams start gaming them -- e.g., reclassifying incidents as "not deployment-related" to protect change-fail rate.

Other fair criticisms:

definitions require judgment calls (what counts as a deployment? a change failure?)
hard to compare across organizations with very different risk profiles
the "Elite" tier is heavily skewed toward consumer web products; industrial and regulated software legitimately looks different
SPACE (Satisfaction, Performance, Activity, Communication, Efficiency -- Forsgren et al, 2021) was introduced partly as an antidote: DORA for the system, SPACE for a team's experience

How To Use It

Implement the four keys before you "improve" anything:

Define each metric concretely for your system: what event is a deployment, what counts as a change failure, and where the timestamp comes from.
Instrument automatically. Deployments should emit an event (a webhook from the pipeline, or a deployment marker to your observability tool -- see concept 15). Incidents should be tagged with whether they were deploy-triggered.
Publish the trailing 30 days. Do not chase week-over-week noise; DORA's own published bands use 30- and 90-day windows.
Pick one weakest metric and one intervention. Re-measure after a quarter.
Re-inspect definitions every 6 months -- "what counts as a change failure" drifts silently as tooling changes.

Check Yourself

Name the four DORA metrics and group each as throughput or stability.
What is the single most common misuse of the DORA metrics, and what damage does it cause?
If your change-fail rate went from 5% to 25% after adopting canary rollouts, what probably changed in your definition, not your system?
Why is SPACE not a replacement for DORA?

Mini Drill or Application

Choose a service you work on or contribute to. In one page, write:

your operational definition of a "deployment" and a "change failure"
the current value of all four metrics (rough estimate is fine)
the one metric you would attack first and the one intervention you would make
the shortcoming you would refuse to use these metrics for

Bring this to a teammate or mentor. The disagreement over definitions is the lesson.

Source Backbone

CI/CD behavior must be checked against official tool docs, but these books provide the durable release-engineering backbone.

Pro Git - branching, tags, signing, and release history.
GitHub Actions in Action - workflow and automation support.
Software Engineering at Google - engineering-process and reliability context.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

The DORA Shortcoming You Should Name​

How To Use It​

Check Yourself​

Mini Drill or Application​

See also (external)​

Source Backbone​