Flaky Tests Are Quietly Skewing Your DORA Metrics

by Meghan LaClair • Nov 26, 2025

Share this post

Flaky tests undermine trust in automation and quietly distort delivery metrics. They inflate reruns and retries, stretch CI time, and mask real defects by teaching teams to ignore failed builds. The immediate impact shows up in Pipeline Run Time and Pipeline Success Rate. Left unresolved, flakiness drifts into Lead Time for Changes and Change Failure Rate by slowing merges and letting risky changes pass.

Leaders should measure flake signals directly, quarantine offenders fast, and push coverage down the test pyramid for stable, fast feedback. Google’s engineering blog documents how non determinism wastes time and erodes signal, and why isolating and fixing flakiness is essential for velocity Flaky tests at Google and how we mitigate them and Test flakiness one of the main challenges of automated testing. Martin Fowler’s guidance on eliminating non determinism in tests explains the root causes and practical remedies at the code and environment level Eradicating non-determinism in tests.

In short: flaky tests make pipelines slower and noisier. They increase retries and reruns, which drives up Pipeline Run Time and reduces Pipeline Success Rate. Over time, that extra friction slows merges and lets risky changes slip through, which quietly worsens Lead Time for Changes and Change Failure Rate.

What is a flaky test and why it distorts DORA metrics

A flaky test passes or fails without a relevant code change. Common causes include time dependencies, concurrency, external services, random seeds, and test order effects, which are well documented in both industry and research Test flakiness one of the main challenges of automated testing. In a trunk based or frequent merge flow, flakiness harms three loops.

Developer loop - false failures trigger local retries and speculative commits
PR loop - red builds cause reruns and review stalls, lengthening queue time
Mainline loop - failures on merge pipelines hide real regressions as teams normalize red

Each of these loops adds delay or hides risk, and that shows up directly in DORA metrics like Lead Time for Changes and Change Failure Rate. The DORA program’s research ties reliable automation and fast feedback to better delivery performance on the four key metrics, which flakiness cuts against by increasing uncertainty and delay DORA research overview.

How to tell if flaky tests are skewing your DORA metrics

Look for data and mechanism, not anecdotes. The signals below are designed for weekly review across services and repos.

The table below translates flaky test symptoms into their effect on DORA metrics and concrete next steps.

Metric	Flake symptom	How it skews DORA metrics	Primary diagnostic	First action
Pipeline Success Rate	Volatile first-pass green rate across days with no correlated code risk	Pushes teams to rerun builds until green, which hides true failure rates	Compare first-attempt green vs final-attempt green	Quarantine top failing tests and require green on first attempt to merge
Pipeline Run Time	Frequent retries and reruns extend median and p90 duration	Inflates the CI portion of Lead Time for Changes	Break out time spent in reruns and retries	Stop auto-rerun policies for failed builds without owner triage
Lead Time for Changes	PRs that pass after multiple rebuilds or rebuilds during code review hours	Extents time that work is in review and delays merge to main	Track PR rebuild count and time to green per PR	Gate on stable suites at lower layers before running E2E
Change Failure Rate	Incidents caused by issues that flaked in CI then passed on rerun	False confidence lets risky changes ship and CFR creeps up	Post incident check for flake history on related tests	Backfill unit or contract tests where failures escaped
Test Fail Rate	Same tests fail intermittently with no code diffs touching them	Noise trains people to ignore failures and reduces attention to real defects	Streak chart for test IDs with fail/ pass/ fail oscillation	Fix timeouts, remove sleeps, and isolate external calls

Flaky tests and their impacts are well understood in the industry. Google engineers describe how flaky suites lead to costly reruns and red fatigue, which in turn drive longer pipelines and lower trust Flaky tests at Google and how we mitigate them.

Where to fix flaky tests first using the test pyramid

Stabilize the base first. Flakiness is more likely when tests touch networks, time, threads, or UI frameworks. Concentrate coverage at unit and service layers to reduce non deterministic surfaces, an approach consistent with widely accepted guidance on keeping most checks fast and isolated The Practical Test Pyramid.

Use a thin set of end to end flows and treat any flake there as a stop the line event because it blocks the merge queue and inflates runtime, which Google cautions against in its flakiness posts Test flakiness one of the main challenges of automated testing.

A simple resolution policy that improves signal

Quarantine and ticket: move flaky tests out of blocking lanes, open a ticket for each, and track them explicitly
Owner and SLA: assign a maintainer and a target fix window per test
Lower the layer: rewrite UI checks as component or API checks when feasible
Kill auto reruns: require a human decision for rebuilds to avoid masking real failures
Backfill gaps: when a flake hides a defect, add assertions lower in the stack

This mirrors the mitigation path described by Google’s testing team and aligns with standard advice to eradicate timing and environment dependencies in tests Flaky tests at Google and how we mitigate them and Eradicating non-determinism in tests.

FAQ

How do flaky tests affect DORA metrics?
Flaky tests slow down pipelines and hide risk, which lengthens Lead Time for Changes, lowers Pipeline Success Rate, and can raise Change Failure Rate. They increase reruns and retries, normalize failed builds, and make it easier for risky changes to slip through. Google’s engineering blog explains these effects and mitigation patterns Flaky tests at Google and how we mitigate them.

What is a reliable way to detect flakiness early?
The most reliable early signals are differences between first attempt and final attempt success, per test fail pass oscillation, and sudden jumps in p90 pipeline runtime driven by retries.

Should we ever ignore flaky failures?
No. Quarantine flaky tests so they do not block merges, but always ticket and fix them. Fowler’s guidance is to remove non determinism rather than accept it as noise Eradicating non-determinism in tests.

Where should we add tests after a flake masks a bug?
You should backfill assertions at the lowest layer that would have caught the defect. Keep end to end checks minimal and stable, consistent with test pyramid practice The Practical Test Pyramid.

Try minware today

Get Started Email/Talk to Us