Why DORA Metrics Fail Without Consistent Definitions

by Meghan LaClair • Nov 25, 2025

Share this post

DORA metrics should bring clarity to software delivery performance. Instead, they often create confusion.

The problem is not the metrics themselves; it is how teams define them. One team's Lead Time starts when a ticket is created and ends when the ticket closes. Another team starts the clock at the first commit and ends at deployment. Change Failure Rate might count only Sev-1 incidents for one team and every post-deployment hotfix in another.

These are not harmless variations. They are different measurements with the same label. Without shared definitions, DORA metrics are misleading and cannot support good decisions.

DORA metrics create clarity only when the definitions behind them are explicit. Without alignment on when measurement begins and ends, teams may report similar looking metrics that reflect very different delivery behaviors underneath. This erodes trust in data driven decisions and can lead engineering organizations to misallocate effort to address the issues.

Why DORA Metric Definitions Drift Across Teams

Definitions drift because teams use different tools, workflows, and reporting timelines.

CI logs capture one view of delivery
Ticketing systems capture yet another

Automation of reporting is rarely designed; it is usually just inherited.

Warning Signs Your Definitions Have Drifted

Teams celebrate improvements in metrics that leadership does not recognize
"Slow" teams use different tools than "fast" teams
In most teams, CFR is either near-zero or unusually high
Metrics shift when dashboards or reporting tools do

Example:
A backend team reports a Lead Time of two days using commit to deploy. A mobile team reports twelve days using ticket created to app store release. Leadership celebrates the backend team because it is perceived as faster and tries to fix the problem on the “slower” mobile team. But the definitions were never aligned. Both teams might be equally efficient or equally slow.

Without clarity, metrics deceive.

What the Five DORA Metrics Measure

DORA metrics were developed through research described in the book Accelerate and are maintained by the team at dora.dev. The core DORA Four Keys measure software delivery throughput (speed) and stability. Reliability, introduced later as a fifth metric, focuses on operational performance and a team's ability to meet reliability targets such as SLOs and error budgets.

The five metrics are:

Deployment Frequency
Lead Time for Changes
Change Failure Rate
Time to Restore Service
Reliability

Originally four metrics, DORA researchers introduced a fifth operational metric and, in the 2021 State of DevOps report, reframed it explicitly as "Reliability" to better align with user facing availability, latency, and performance.

Each metric is powerful when teams align on when the clock starts, what the end state looks like, and what thresholds matter.

Common DORA Metric Definitions and Trade Offs

Canonical DORA benchmarks assume that Lead Time for Changes runs from code commit to successful deployment in production, and that Deployment Frequency counts production deployments. Many teams, however, use local definitions that better fit their tools and workflows.

Here are common choices and their trade offs:

Metric	Start	End	Trade offs
Lead Time for Changes	Ticket created or first commit	Code deployed or ticket closed	Commit to deploy is precise and aligns with DORA benchmarks but ignores upstream delays. Ticket to deploy reflects the full business flow from idea to user value but is harder to automate and has many more inputs to address as part of the solution.
Deployment Frequency	Merge to main or release created	Deployed to production	Counting merges inflates the rate if merges are not deployed immediately. Counting only production deployments aligns with DORA but can hide work that is released behind feature flags.
Change Failure Rate	Each production deploy	Incident within a release window	Counting only Sev-1 incidents misses many failures. Counting every incident or hotfix may overstate risk in noisy systems.
Time to Restore Service	Alert triggered	SLO restored or customer resolved	Measuring until the postmortem closes lags behind user recovery. SLO based restore times better reflect what users experience.
Reliability	SLO window start	SLO window end	Simple uptime can look good even if latency or partial failures hurt users. Error budget style reliability captures availability, latency, and partial impact together.

Lead Time for Changes: Commit to Deploy vs Ticket to Deploy

There are two common approaches to measuring Lead Time for Changes.

Commit to deploy: Easier to automate using Git logs and CI or CD data. This reflects engineering speed and is the definition used by Four Keys and Google Cloud benchmarks.
Ticket to deploy: Measures how long users wait for a feature or fix. This includes delays like prioritization, grooming, and scope creep before the first commit.

Choose based on what decision you want to inform:

If you want to focus on engineering flow and pipeline efficiency, use commit to deploy.
If you want to understand how long customers wait from idea to delivered value, use ticket created to deploy.

Whichever you pick, write it down and keep it consistent across teams.

Change Failure Rate: Sev-1 Only vs All Hotfixes

Change Failure Rate (CFR) measures the percentage of deployments that cause a user impacting failure and require remediation.

Two common ways to count failures are:

Sev-1 only: Count only major incidents such as full outages or severe data loss. This matches how many SRE teams track availability and keeps the focus on the most critical failures.
All hotfixes within a release window: Count any regression fix or rollback that happens shortly after a deployment, even if the impact is partial or limited.

If your CFR is zero, your criteria are probably too narrow or your incident capture is incomplete. If your CFR is very high, reevaluate what you count as a failure and whether noisy, low impact incidents are polluting the metric.

DORA Metrics in the Real World

Different organizations apply DORA metrics in slightly different ways.

Google SRE measures reliability using SLOs and error budgets, and treats them as the primary tools for managing reliability and balancing it against feature work.
Thoughtworks uses the Four Key Metrics as system level indicators of delivery performance and warns against relying only on tooling without considering context when comparing teams.
Vendors and case studies such as Datadog and Codefresh point out that teams must decide which events count as commits, deployments, and failures, and that these choices often vary even within the same organization.
DORA's Four Keys provides a reference implementation where you can customize which events count as commits, deployments, and incidents for your environment.

The lesson is simple: there is no single universal implementation of these metrics, so you must document how you measure them.

How to choose the right definitions

Start by deciding what each of the five DORA metrics should help you decide in your organisation, then choose start and end points that match that purpose.

Deployment Frequency
If you want to see how often production changes, count deploys to production per service or system.
If you want to see how often customers see change, count user‑visible releases (for example, mobile app store releases or feature flag rollouts).
Lead Time for Changes
If you care about engineering speed, measure from first commit (or pull request opened) to deploy in production.
If you care about end‑to‑end responsiveness for users, measure from ticket created to deploy in production.
Change Failure Rate
If you want a strict signal on critical failures, count only deployments that cause Sev‑1 or Sev‑2 incidents or emergency hotfixes.
If you want a broader picture of quality, count any deploy that requires a rollback, hotfix, or incident above an agreed severity.
Time to Restore Service
If you care most about user impact, measure from the first user‑visible impact (or alert) to when service is back within SLO.
If you care about process efficiency, you can also track from incident declared to incident closed, as long as you define that consistently.
Reliability
If you want a simple, shared target, measure percentage of time each service meets its SLOs.
If you want to manage risk more actively, track both SLO compliance and error budget consumption, and tie them to policies on release freezes and postmortems.

Whichever choices you make, write down:

The exact start event
The exact end event
Any filters, such as which services, request types, or severities are included

Then apply those rules consistently and revisit them when your architecture or ways of working change.

How to Standardize DORA Metrics Across Teams (Rollout Plan)

In many organizations with fewer than 100 engineers, a rollout like this takes roughly 4 to 6 weeks. Larger organizations may need 8 to 12 weeks, especially if they have multiple business units or complex tooling.

Phase 1: Discovery and Alignment

Inventory current definitions per team. Capture how each team currently measures Lead Time for Changes, Deployment Frequency, Change Failure Rate, Time to Restore Service, and Reliability.
Form a working group. Include engineering leaders and senior engineers representing the different engineering functions.
Define standard metrics with clear start and end points. Agree on what counts as a commit, a deployment, a failure, and a restore event. Decide which definitions are required org- wide versus optional at the team level.
Validate that each metric can be measured. Confirm that you can, in practice, pull the required data from source control, CI or CD, incident systems, and observability tools and link them data together as required.

Phase 2: Implementation and Monitoring

Deploy the new definitions and keep them stable for one quarter. Avoid changing definitions mid quarter so that trends are meaningful.
Split org wide dashboards from team specific dashboards. Use a small, consistent set of definitions for organization level views, and allow teams to maintain local metrics for coaching and experimentation.
Review definitions annually or after major changes. Revisit definitions after large migrations, major process changes, or tool replacements.

Consistency is what turns metrics from trivia into insight.

Fixing Definition Drift

If your metrics already differ across teams:

Do not throw out old data. Instead, label it with the definitions that were in use at the time.
Standardize forward and, where possible, apply mappings so that you can approximate old metrics under the new definitions.
Send a short note explaining what changed, why it changed, and how to interpret new numbers versus old ones.
Add visible definitions to dashboards in footers, legends, or hover notes so people can see exactly what each metric means.

Treat metric drift like technical debt. It stays invisible until it hurts, and the fix almost always starts with better documentation.

Tool Requirements for Flexible Metrics

Your tooling should make it easy to define and evolve metric definitions without rewriting pipelines.

Look for tools that support:

Multiple input sources for events (Git, CI or CD, incident systems, observability tools)
Custom start and stop events for each metric
Incident classification and severity
SLO driven reliability windows
Historical segmentation when definitions change

Open standards help here. Four Keys provides an open data pipeline for DORA metrics, and OpenSLO provides a vendor neutral way to describe SLOs and error budgets. Avoid tools that hard- code definitions, such as tying Lead Time only to commit to merge, if you need a more realistic view of your delivery flow.

Summary

When clearly defined, DORA metrics are powerful. Aligned definitions allow consistent reporting, better decisions, and trust across teams. A few weeks of focused work to standardize definitions pays off through clarity and smarter resource allocation.

Start by documenting how your teams measure metrics today. The gaps will show you where to focus.

FAQs

What are the DORA metrics?
DORA metrics are five DevOps performance metrics that measure software delivery and operations: deployment frequency, lead time for changes, change failure rate, time to restore service, and reliability.

Why do definitions vary?
Definitions vary because teams use different tools and workflows and rarely write down exactly when each metric starts and ends. Without explicit alignment, definitions drift over time.

Should we align definitions?
Yes. If you want to report consistent metrics across teams or at the organization level, you must align definitions. Local variations are useful for internal coaching but should not be mixed into org wide dashboards.

What is the best Lead Time for Changes definition?
There is no single best definition. Commit to deploy measures engineering speed. Ticket created to deploy measures end to end time to user value. Pick the one that matches the question you are trying to answer.

Can we benchmark against industry standards?
You can benchmark against industry standards only if your definitions match theirs. DORA benchmarks assume commit to production lead time and production deployments, so adjust or map your definitions before comparing.

How long does standardization take?
In many organizations with fewer than 100 engineers, standardization takes about 4 to 6 weeks. Larger organizations often need 8 to 12 weeks, especially if they split by business unit or have many platform teams.

Try minware today

Get Started Email/Talk to Us