Flaky Tests Are Quietly Skewing Your DORA Metrics

All Posts
Share this post
Share this post

Flaky tests undermine trust in automation and quietly distort delivery metrics. They inflate reruns and retries, stretch CI time, and mask real defects by teaching teams to ignore failed builds. The immediate impact shows up in Pipeline Run Time and Pipeline Success Rate. Left unresolved, flakiness drifts into Lead Time for Changes and Change Failure Rate by slowing merges and letting risky changes pass.

Leaders should measure flake signals directly, quarantine offenders fast, and push coverage down the test pyramid for stable, fast feedback. Google’s engineering blog documents how non determinism wastes time and erodes signal, and why isolating and fixing flakiness is essential for velocity Flaky tests at Google and how we mitigate them and Test flakiness one of the main challenges of automated testing. Martin Fowler’s guidance on eliminating non determinism in tests explains the root causes and practical remedies at the code and environment level Eradicating non-determinism in tests.

In short: flaky tests make pipelines slower and noisier. They increase retries and reruns, which drives up Pipeline Run Time and reduces Pipeline Success Rate. Over time, that extra friction slows merges and lets risky changes slip through, which quietly worsens Lead Time for Changes and Change Failure Rate.

What is a flaky test and why it distorts DORA metrics

A flaky test passes or fails without a relevant code change. Common causes include time dependencies, concurrency, external services, random seeds, and test order effects, which are well documented in both industry and research Test flakiness one of the main challenges of automated testing. In a trunk based or frequent merge flow, flakiness harms three loops.

  1. Developer loop - false failures trigger local retries and speculative commits
  2. PR loop - red builds cause reruns and review stalls, lengthening queue time
  3. Mainline loop - failures on merge pipelines hide real regressions as teams normalize red

Each of these loops adds delay or hides risk, and that shows up directly in DORA metrics like Lead Time for Changes and Change Failure Rate. The DORA program’s research ties reliable automation and fast feedback to better delivery performance on the four key metrics, which flakiness cuts against by increasing uncertainty and delay DORA research overview.

How to tell if flaky tests are skewing your DORA metrics

Look for data and mechanism, not anecdotes. The signals below are designed for weekly review across services and repos.

The table below translates flaky test symptoms into their effect on DORA metrics and concrete next steps.

Metric Flake symptom How it skews DORA metrics Primary diagnostic First action
Pipeline Success Rate Volatile first-pass green rate across days with no correlated code risk Pushes teams to rerun builds until green, which hides true failure rates Compare first-attempt green vs final-attempt green Quarantine top failing tests and require green on first attempt to merge
Pipeline Run Time Frequent retries and reruns extend median and p90 duration Inflates the CI portion of Lead Time for Changes Break out time spent in reruns and retries Stop auto-rerun policies for failed builds without owner triage
Lead Time for Changes PRs that pass after multiple rebuilds or rebuilds during code review hours Extents time that work is in review and delays merge to main Track PR rebuild count and time to green per PR Gate on stable suites at lower layers before running E2E
Change Failure Rate Incidents caused by issues that flaked in CI then passed on rerun False confidence lets risky changes ship and CFR creeps up Post incident check for flake history on related tests Backfill unit or contract tests where failures escaped
Test Fail Rate Same tests fail intermittently with no code diffs touching them Noise trains people to ignore failures and reduces attention to real defects Streak chart for test IDs with fail/ pass/ fail oscillation Fix timeouts, remove sleeps, and isolate external calls

Flaky tests and their impacts are well understood in the industry. Google engineers describe how flaky suites lead to costly reruns and red fatigue, which in turn drive longer pipelines and lower trust Flaky tests at Google and how we mitigate them.

Where to fix flaky tests first using the test pyramid

Stabilize the base first. Flakiness is more likely when tests touch networks, time, threads, or UI frameworks. Concentrate coverage at unit and service layers to reduce non deterministic surfaces, an approach consistent with widely accepted guidance on keeping most checks fast and isolated The Practical Test Pyramid.

Use a thin set of end to end flows and treat any flake there as a stop the line event because it blocks the merge queue and inflates runtime, which Google cautions against in its flakiness posts Test flakiness one of the main challenges of automated testing.

A simple resolution policy that improves signal

  • Quarantine and ticket: move flaky tests out of blocking lanes, open a ticket for each, and track them explicitly
  • Owner and SLA: assign a maintainer and a target fix window per test
  • Lower the layer: rewrite UI checks as component or API checks when feasible
  • Kill auto reruns: require a human decision for rebuilds to avoid masking real failures
  • Backfill gaps: when a flake hides a defect, add assertions lower in the stack

This mirrors the mitigation path described by Google’s testing team and aligns with standard advice to eradicate timing and environment dependencies in tests Flaky tests at Google and how we mitigate them and Eradicating non-determinism in tests.

FAQ

How do flaky tests affect DORA metrics?
Flaky tests slow down pipelines and hide risk, which lengthens Lead Time for Changes, lowers Pipeline Success Rate, and can raise Change Failure Rate. They increase reruns and retries, normalize failed builds, and make it easier for risky changes to slip through. Google’s engineering blog explains these effects and mitigation patterns Flaky tests at Google and how we mitigate them.

What is a reliable way to detect flakiness early?
The most reliable early signals are differences between first attempt and final attempt success, per test fail pass oscillation, and sudden jumps in p90 pipeline runtime driven by retries.

Should we ever ignore flaky failures?
No. Quarantine flaky tests so they do not block merges, but always ticket and fix them. Fowler’s guidance is to remove non determinism rather than accept it as noise Eradicating non-determinism in tests.

Where should we add tests after a flake masks a bug?
You should backfill assertions at the lowest layer that would have caught the defect. Keep end to end checks minimal and stable, consistent with test pyramid practice The Practical Test Pyramid.