DORA Metrics and the Four Keys
What This Concept Is
DORA (DevOps Research and Assessment, now part of Google Cloud) spent nearly a decade measuring what actually correlates with software delivery performance. The four headline metrics -- the "four keys" -- are:
- Deployment Frequency. How often you ship to production.
- Lead Time for Changes. From code commit to running in production.
- Change Failure Rate. Percentage of deployments that cause a degraded service or require a fix (hotfix, rollback, patch).
- Failed Deployment Recovery Time (formerly MTTR -- Mean Time To Restore). How long from a service-impacting failure to recovery.
The first two measure throughput. The last two measure stability. DORA's central finding is that throughput and stability are not a tradeoff -- the highest-performing teams are best on all four. The conclusion is stated plainly in the annual State of DevOps reports, and defended in the book Accelerate (Forsgren, Humble, Kim, 2018), which is the most cited academic reference on this topic in industry.
A fifth metric -- Reliability -- was added in 2021 to capture operational quality beyond pure deploy safety (availability, latency, error budgets). Some teams report "four keys + reliability"; others hold a strict four.
Why It Matters Here
These four are the only widely accepted empirical measure of delivery health. They let you:
- compare two teams' delivery systems honestly, not by vibes
- diagnose where your pipeline is weak (long lead time? high change-fail? slow recovery?)
- justify investment in CI/CD to non-technical stakeholders
- resist process growth that improves one metric while silently harming another
They also explicitly validate the small-batch, trunk-based, feature-flagged model. Teams that adopt TBD, short-lived branches, and automated rollback move from Low/Medium into High/Elite bands as published by DORA's annual State of DevOps report. Every year since 2018, the "Elite" band's defining property has stayed the same: on-demand deploys, sub-day lead time, sub-5% change-fail rate, sub-1-hour recovery.
Concrete Example
A hypothetical service report:
| Metric | Value | DORA Band |
|---|---|---|
| Deployment Frequency | 3 deploys/week | Medium |
| Lead Time for Changes | 2 days | Medium |
| Change Failure Rate | 25% | Low |
| Failed Deployment Recovery Time | 4 hours | Low |
Reading this honestly: the team is deploying often enough, but 1 in 4 deploys hurts users and recovery is slow. The intervention is not "deploy less often" (that would make lead time worse). The intervention is probably: add a canary, automate rollback, and require a post-deploy smoke test. That targets change-fail rate and recovery without touching throughput.
Same team, six months later:
| Metric | Before | After | Band shift |
|---|---|---|---|
| Deployment Frequency | 3/week | 5/day | Medium -> Elite |
| Lead Time | 2 days | 3 hours | Medium -> Elite |
| Change Failure Rate | 25% | 4% | Low -> Elite |
| Recovery Time | 4 hours | 12 min | Low -> Elite |
The throughput gains came because stability improved: a canary + automated rollback made smaller deploys safe, which made more frequent deploys socially acceptable.
Common Confusion / Misconception
"Lead Time = time to merge." No. DORA's lead time is commit to production, not commit to merge. Including deploy time is the whole point; otherwise a team that merges fast but deploys weekly looks better than it is.
"MTTR means average incident length." Close but not quite. DORA measures specifically failed-deployment recovery -- time from a deploy-caused incident to recovery. It is a deployment-safety metric, not a general ops metric. Don't conflate it with MTTR as used in site reliability engineering, which is any-incident recovery.
"Higher deploy frequency automatically means higher quality." Deployment frequency alone is gameable -- a team can deploy hourly with a 50% failure rate. The four metrics are meaningful together; any one in isolation is misleading.
"DORA works the same for any team." The definitions need calibrating. For a platform team, a "deployment" might be a Terraform apply. For a data team, a pipeline promotion. For a library team, a published release. The principles generalize; the events you count differ.
The DORA Shortcoming You Should Name
DORA is a system-level metric, not a team-level performance review. Its most cited misuse is grading individual engineers, or ranking teams against each other across very different domains (payments vs internal tooling vs ML training). The original authors have repeatedly warned against this. The metrics answer "is our delivery system healthy?" not "who is a good engineer?" Using them for performance reviews rapidly destroys their value because teams start gaming them -- e.g., reclassifying incidents as "not deployment-related" to protect change-fail rate.
Other fair criticisms:
- definitions require judgment calls (what counts as a deployment? a change failure?)
- hard to compare across organizations with very different risk profiles
- the "Elite" tier is heavily skewed toward consumer web products; industrial and regulated software legitimately looks different
- SPACE (Satisfaction, Performance, Activity, Communication, Efficiency -- Forsgren et al, 2021) was introduced partly as an antidote: DORA for the system, SPACE for a team's experience
How To Use It
Implement the four keys before you "improve" anything:
- Define each metric concretely for your system: what event is a deployment, what counts as a change failure, and where the timestamp comes from.
- Instrument automatically. Deployments should emit an event (a webhook from the pipeline, or a deployment marker to your observability tool -- see concept 15). Incidents should be tagged with whether they were deploy-triggered.
- Publish the trailing 30 days. Do not chase week-over-week noise; DORA's own published bands use 30- and 90-day windows.
- Pick one weakest metric and one intervention. Re-measure after a quarter.
- Re-inspect definitions every 6 months -- "what counts as a change failure" drifts silently as tooling changes.
Check Yourself
- Name the four DORA metrics and group each as throughput or stability.
- What is the single most common misuse of the DORA metrics, and what damage does it cause?
- If your change-fail rate went from 5% to 25% after adopting canary rollouts, what probably changed in your definition, not your system?
- Why is SPACE not a replacement for DORA?
Mini Drill or Application
Choose a service you work on or contribute to. In one page, write:
- your operational definition of a "deployment" and a "change failure"
- the current value of all four metrics (rough estimate is fine)
- the one metric you would attack first and the one intervention you would make
- the shortcoming you would refuse to use these metrics for
Bring this to a teammate or mentor. The disagreement over definitions is the lesson.
See also (external)
- DORA -- homepage and capabilities -- entry point to the research
- DORA -- Quick Check -- self-assessment tool for the four keys
- DORA -- core capabilities model -- what actually drives the metrics
- DORA -- the four keys metric definitions -- precise operational definitions
- Google Cloud -- Accelerate State of DevOps reports -- annual published research
- The SPACE of Developer Productivity (Forsgren et al., ACM Queue) -- complementary team-experience framework
Source Backbone
CI/CD behavior must be checked against official tool docs, but these books provide the durable release-engineering backbone.
- Pro Git - branching, tags, signing, and release history.
- GitHub Actions in Action - workflow and automation support.
- Software Engineering at Google - engineering-process and reliability context.