Change Failure Rate in the Age of AI-Generated Code

All Posts
Share this post
Share this post

AI coding assistants and agentic tools can multiply output. If your review, testing, and incident response capacity stays the same, you can ship more changes that are harder to understand and easier to break. That is when production starts to feel brittle.

Change failure rate is the percentage of production deployments that require immediate intervention, usually a rollback or hotfix. You can measure it by linking each deployment to the incidents and emergency fixes that follow it. In AI-assisted engineering, change failure rate, a lagging indicator, works best when paired with leading indicators like review coverage, pipeline health, and code churn so teams can ship faster without masking failures.

What is change failure rate?

DORA defines change fail rate as the ratio of deployments that require immediate intervention after a deployment, often a rollback or a hotfix.

Two details matter in practice:

  • Change failure rate is about outcomes in production, not whether code "looks good" in review.
  • DORA groups change failure rate and deployment rework rate into the broader concept of Instability. Deployment rework rate captures unplanned deployments triggered by incidents. Tracking both helps you distinguish "a bad deploy" from "we keep doing emergency work.”

A common formula is:

  • Change failure rate (%) = (failed production deployments / total production deployments) x 100

The math is simple. The work is making sure everyone means the same thing by "deployment" and "failed."

How do you calculate change failure rate?

If you want change failure rate to be comparable over time, treat it like an engineering interface. Define inputs, outputs, and edge cases.

  1. Decide the scope

    • DORA recommends applying delivery performance metrics at the application or service level, because context varies across systems.
    • Start with production only. Mixing staging and production hides real risk.
  2. Define what counts as a deployment

    • Pick a single system of record for production deployments (CD tool, release pipeline, or deployment events).
    • Be consistent about rollouts (canary and progressive delivery still count as deployments if users are exposed).
  3. Define what counts as a failure

    • Include rollbacks, hotfix deployments, and incidents that require urgent remediation.
    • Write down your time window. Many teams use a fixed window after deployment (for example 24 to 72 hours) so they are measuring "blast radius after change," not unrelated backlog bugs.
  4. Link deployments to incidents and emergency fixes

    • The cleanest approach is to join deployments to incidents using a deployment identifier. Google's Four Keys guidance describes measuring change failure rate by counting deployments and linking them to incidents that reference the deployment ID.
  5. Sanity check construct validity before you optimize

    • If you cannot link incidents to deployments, teams often use surrogate measures such as high-severity bugs per release or per deploy. Surrogates can be useful, but they are easy to misread and easy to game. Kaner and Bond walk through why metrics need validation and how measurement can distort behavior when stakes are high.
    • A related trap is the McNamara fallacy: dropping important factors because they are harder to measure. Kaestner's measurementnotes summarize this failure mode in the context of software engineering.
  6. Report the numerator and denominator

    • A percentage with a tiny denominator is unstable. Publish both counts so teams can see when a swing is just "two bad deploys out of ten."

How does AI-generated code affect change failure rate?

AI changes the shape of risk more than the definition of change failure rate.

Research on AI-generated code consistently finds correctness and security concerns, including functional bugs and vulnerabilities across many settings. A Georgetown CSET report breaks the risk into three buckets: insecure code generation, attacks on the models and tools themselves, and downstream security impacts such as feedback loops.

What that looks like in delivery data often has a pattern:

  • More code per change. Agents can produce large diffs quickly, which raises review load and makes it easier to miss logic errors.
  • More logic and dependency mistakes. A 2026 analysis shared on the Stack Overflow blog reports higher rates of logic and correctness issues in AI-created pull requests than in human-created ones.
  • More variance. The team can ship faster on routine work, then lose time when a risky change slips through and causes an incident.

The takeaway is simple: failure modes can shift toward errors that look plausible at a glance and take longer to debug. If guardrails do not scale with output, change failure rate can rise.

Which metrics should you track alongside change failure rate?

Change failure rate is a lagging outcome. To manage it, pair it with leading indicators that show whether quality controls are scaling with output.

Metric Signal Why it matters for AI-assisted delivery
Change Failure Rate How often production deployments require urgent remediation. The outcome metric. Use it to validate whether AI adoption is improving delivery or creating more production work.
Deployment Rework Rate How often you do unplanned deployments due to incidents. Separates "a bad deploy" from "a steady stream of emergency work" when AI increases throughput.
Mean Time to Restore (MTTR) How quickly you recover when a deployment fails. If AI increases change volume, fast recovery limits blast radius when a failure happens.
Deployment Frequency How often you deploy to production. Interprets the denominator. If deployment frequency jumps, CFR can move even if the absolute number of failures stays flat.
Pipeline Success Rate How reliable your CI and tests are. Flaky pipelines create pressure to bypass checks, which raises risk when changes are generated faster than they can be validated.
Review Latency How long pull requests wait for meaningful review. When AI increases PR volume, review queues grow. Long queues correlate with rushed approvals and larger batch merges.
No-Review PR Dev Day Ratio How much work merges without review. A direct guardrail against "ship it" behavior when agents make code generation cheap.
Code Churn How much code is rewritten shortly after it is added. High churn can mean unclear requirements, shallow reviews, or fragile AI-generated code that needs repeated fixes.
Large Branch Rate How often you merge very large pull requests. Large diffs are hard to review. They increase the chance that logic and configuration errors reach production.

If you track these together, you can usually explain a CFR change without guessing. For example, a CFR spike that follows rising Large Branch Rate and falling Pipeline Success Rate points to quality controls breaking under load.

How do you reduce change failure rate without slowing delivery?

Start with changes that reduce risk per deploy and increase feedback speed.

  1. Reduce batch size on purpose

    • DORA calls out smaller changes as a common lever because they are easier to reason about and easier to recover from when something fails.
    • Operationalize this with Large Branch Rate and review policies that keep PRs reviewable.
  2. Require evidence in code review

    • For AI-assisted changes, reviewers should look for tests that demonstrate behavior and for notes about assumptions.
    • If you are seeing high No-Review PR Dev Day Ratio or long Review Latency, fix that first. A perfect incident workflow cannot compensate for bypassed review.
  3. Make your pipeline the default safety net

    • Invest in CI reliability so developers trust it. Low Pipeline Success Rate is an anti-signal for AI adoption because it encourages skipping checks.
    • Add automated gates that catch common AI failure modes: dependency changes, configuration drift, and unsafe patterns. The goal is fast feedback, not bureaucracy.
  4. Use risk tiers for where AI can operate autonomously

    • The CSET report highlights that code generation can produce insecure code and that evaluation is complex, so treat critical paths as higher risk by default.
    • For high-risk components, require smaller changes, stronger test coverage, and explicit sign-off.
  5. Improve recovery as a first-class capability

    • Change failure rate will never be zero. Build fast rollback, feature flagging, and clear runbooks so you can reduce Failed Deployment Recovery Time when a bad change slips through.
    • Track whether your incident fixes show up as Deployment Rework Rate. If they do, prioritize the root causes behind the most common failure modes.
  6. Run AI adoption as an experiment

    • Compare trends before and after adoption per service, and keep the definition of "failure" stable.
    • Avoid comparing teams with different incident thresholds or different deployment processes. DORA explicitly warns that context differences make broad comparisons misleading.

Common pitfalls when interpreting change failure rate

These mistakes can make change failure rate look better while reliability gets worse.

  • Turning CFR into a target. When a measure becomes a target, it stops being a good measure. Teams can "improve" CFR by reclassifying incidents or delaying declarations.
  • Changing the denominator without noticing. A big shift in Deployment Frequency changes CFR math. Always look at raw counts.
  • Using a proxy without validating it. Counting bugs as failures is sometimes necessary, but bug counts are surrogate measures with known validity problems. Kaner and Bond describe how single metrics can distort decisions if they are not tied to the attribute you actually care about.
  • Blaming AI for everything. AI can change failure modes, but production failures still come from system design, test coverage, operability, and review practices.

minware connects delivery data across code, pull requests, pipelines, and work items so teams can see how quality controls and production outcomes move together.

A practical workflow is:

  • Monitor Change Failure Rate, Deployment Rework Rate, and Failed Deployment Recovery Time as the outcomes.
  • Use Review Latency, No-Review PR Dev Day Ratio, and Large Branch Rate as early warnings that review capacity is falling behind output.
  • Use Pipeline Success Rate and Code Churn to spot fragile changes before they show up as incidents.

The goal is to have enough visibility to ask better questions in retrospectives, then validate whether the changes you make actually reduce production pain.

FAQ about change failure rate and AI-generated code

What is a good change failure rate?

A useful target is one your team can sustain without heroics while meeting your reliability goals. Start by establishing a baseline per service, then focus on trend improvements and on reducing the highest-impact failures.

Should we count every bug as a change failure?

Usually no. Change failure rate is most useful when it tracks failures that require urgent remediation, such as rollbacks, hotfixes, and user-impacting incidents. If you include minor issues, CFR becomes noise and will drift with triage habits.

How do we attribute a failure to a specific deployment?

Pick a consistent attribution window after deployment and document it. If you have deployment IDs in incident records, you can make the linkage explicit. If you do not, treat attribution as an approximation and validate it against a few real incidents.

Does AI always increase change failure rate?

No. AI can increase output without increasing CFR if review, testing, and operability improve at the same time. The only reliable way to know is to track CFR with leading indicators like Pipeline Success Rate, No-Review PR Dev Day Ratio, and Large Branch Rate and to keep definitions stable across the adoption period.

Change failure rate is still the right metric in the age of AI-generated code. The change is in how you support it: tighter definitions, stronger links between deployments and incidents, and guardrails that scale as code generation gets cheaper.