Using Cycle Time to Measure AI Productivity Gains

All Posts
Share this post
Share this post

AI coding assistants can generate a lot of code quickly. That can make activity metrics like lines of code look great, even when delivery speed and stability do not improve. If you want a measurement that stays tied to customer value, track cycle time and validate it with quality guardrails.

Cycle time is the time a work item spends in active progress, from when someone starts it to when it is done. In software delivery, you can measure cycle time for tickets, pull requests, or changes to production. A real AI productivity gain shows up as lower cycle time for comparable work, without higher rework, incident load, or bug creation.

What is cycle time in software development

Cycle Time is the time it takes to complete a work item once active work starts. It is usually measured from an in progress state to done, and it excludes time the work item spent waiting in a backlog before anyone touched it.

Lead Time for Changes is broader. It includes waiting and queueing time, which matters for customer experience. In DevOps, DORA’s change lead time focuses on the path from code committed to deployed in production, which is useful when you want an end-to-end delivery view.

In practice, you can choose one primary cycle time lens for AI evaluation:

  • Ticket cycle time: from ticket in progress to done
  • PR cycle time: from PR opened to merged, often tracked as PR Lead Time for Changes
  • Change cycle time: from first commit to deploy, aligned with DORA change lead time

Pick one primary definition per team or service, write it down, and measure it the same way during the experiment.

Why cycle time works for measuring AI productivity gains

AI can change the shape of work more than it changes the amount of work. It can reduce time spent writing boilerplate, but increase time spent reviewing, fixing edge cases, or untangling large diffs. Cycle time captures the full journey of a work item through the system, including the handoffs and delays that dominate delivery time in many teams.

Cycle time also makes it easier to separate tooling effects from planning noise. If your backlog is volatile, or work is constantly reprioritized, lead time can swing even when engineering execution is steady. Cycle time helps isolate how quickly teams can finish what they start.

There is also a simple flow relationship worth remembering: when work in progress rises, cycle time tends to rise. This is one reason WIP limits work in practice. That matters for AI because teams often start more parallel work when coding feels faster, which can slow delivery even if individuals type less.

Evidence is mixed, so measurement matters

Some controlled studies show large speedups for specific tasks. A GitHub Copilot experiment reported developers completing an implementation task significantly faster with Copilot in a controlled setting.

Field experiments show smaller but meaningful effects, and also highlight measurement challenges. One large-scale field analysis reported increases in pull requests completed per week after Copilot access, with caveats about compliance and statistical power.

Other rigorous work shows the opposite in some environments. METR ran a randomized controlled trial with experienced open-source developers working in mature codebases they already knew well and found AI tools increased task completion time in that setting. METR later published an update describing adjustments and evidence that outcomes can shift based on who is measured and how usage changes over time.

That is why you should measure cycle time in your environment instead of assuming a universal benefit.

How to measure cycle time for AI coding tools

You do not need a perfect experiment design to get value, but you do need consistency. The goal is to compare like with like.

Step 1: Define what starts and ends the clock

Write down your operational definition.

Common choices:

  • Start: ticket moves to in progress, or first commit on a branch, or PR opened
  • End: merged to main, or deployed to production, or ticket moved to done

If your team merges quickly but deploys slowly, measuring only PR cycle time will miss the bottleneck. For that case, measuring change lead time alongside PR cycle time is usually more honest.

Step 2: Measure the distribution, not only the average

Averages hide long tails. AI can reduce the median but increase the 90th percentile if it creates more complex reviews or more flaky CI runs.

Use percentiles:

  • P50 to represent typical work
  • P75 or P85 to represent how often work drags
  • P95 to show the tail that drives stakeholder pain

Step 3: Segment work so you are comparing comparable items

If your AI pilot coincides with a rewrite, a major incident, or a quarter of platform work, raw before and after comparisons will mislead you.

Useful segments:

  • Work type: feature, bug, refactor, operational work
  • Size: exclude outliers using Large Ticket Rate and Large Branch Rate
  • Repo or service: one workflow per service is often cleaner
  • Team maturity: teams with high WIP and long review queues behave differently

Step 4: Break cycle time into stages to see where AI is helping

Cycle time is a result. The levers are in the stages.

If you track PR Time per Status and Ticket Time per Status, you can see whether AI is reducing coding time, shifting time into review, or creating more time in CI.

A cycle time scorecard for AI pilots

Use cycle time as the primary outcome metric. Pair it with guardrails that protect quality and prevent metric gaming. The table below is a practical set that fits most teams.

Metric What it detects What a real AI gain looks like What to check if it worsens
Cycle Time (ticket or PR) Delivery speed for completed work items Median and tail both trend down for comparable work WIP, batching, review queues, pipeline delays
PR Lead Time for Changes Time from PR opened to merged Less waiting in review and fewer stalled PRs PR Time per Status, reviewer load, PR size via Large Branch Rate
Work in Progress (WIP) Context switching and flow overload Stable or lower WIP with faster completion Teams starting more items because coding feels faster
Post PR Review Dev Day Ratio Rework after review Stable ratios or a small decline as drafts improve AI-generated changes not matching repo conventions, unclear requirements
Never Merged Ratio Abandoned work and stalled branches No increase, ideally a decrease in discarded effort More experimentation without clear acceptance criteria
Pipeline Success Rate Flaky tests and integration friction Stable or improving success rate AI introducing brittle tests or mismatched configs
Pipeline Run Time CI bottlenecks Stable runtime while throughput increases More frequent runs, heavier test suites, infra limits
Change Failure Rate Production instability after changes No increase as cycle time drops Review quality, missing tests, rushed merges
New Bugs Per Dev Day Bug creation rate independent of deploy size Stable or down Over-reliance on AI suggestions, weak validation

Guardrails that keep cycle time honest

Cycle time can improve for the wrong reasons.

Common failure modes:

  • Teams artificially split work into many more PRs than are necessary PRs that move fast but create coordination overhead
  • Reviews get skipped to reduce waiting time, increasing defects later
  • Work shifts from tickets into untracked channels, so cycle time looks better on paper

That is why guardrails like No-Review PR Dev Day Ratio, Direct Main Commit Dev Day Ratio, Change Failure Rate, and Pipeline Success Rate matter. They tell you whether speed is coming from healthier flow or weaker controls.

Common confounders when measuring AI impact with cycle time

If you want cycle time to answer the AI question, you have to watch the variables that move cycle time even without AI.

  1. Work mix changes

If your pilot period includes more support work or more refactoring, cycle time will shift. Segment by work type.

  1. WIP increases

If developers start more parallel work because drafting is easier, cycle time can get worse. That is a system effect, and it is a predictable one when WIP is unconstrained.

  1. Review capacity stays fixed

AI can increase the volume of PRs or the size of diffs. If reviewer capacity stays the same, queue time goes up. Look at PR Time per Status to see whether review is the new bottleneck.

  1. CI becomes the limiter

More commits and more PRs can mean more pipeline load. If Pipeline Run Time or Pipeline Success Rate worsens, cycle time will follow.

  1. Novelty effects and expectation bias

Developers can feel faster even when the data says otherwise, which METR highlighted in its RCT setting. That makes outcome metrics like cycle time and quality more reliable than self-reports on speed.

What to do with the results

If cycle time improves and guardrails stay flat, you have evidence the tool is helping in your workflow. You can expand usage with confidence, and then look for the stage that improved to replicate it across teams.

If cycle time improves and quality worsens, you have a process problem. The tool may still be useful, but you need stronger review practices, better test coverage, or clearer definitions of done before expanding.

If cycle time does not improve, do not assume the tool failed. Look at stage breakdowns. AI often shifts time from writing to reviewing and validation, which can still be valuable if it reduces cognitive load or improves maintainability. That is a separate question, and it deserves separate measurement.

How minware helps teams quantify AI productivity with cycle time

AI evaluation gets messy when metrics live in separate tools. minware’s approach is to connect repos, tickets, pipelines, and incidents into one model so you can see how work flows through the system.

For AI rollouts, that makes it easier to:

  • Track PR Lead Time for Changes and isolate bottlenecks with PR Time per Status
  • Measure Work in Progress (WIP) to catch flow overload early
  • Separate outliers with Large Ticket Rate and Large Branch Rate
  • Watch rework signals like Post PR Review Dev Day Ratio and Never Merged Ratio
  • Keep quality guardrails visible with Change Failure Rate, New Bugs Per Dev Day, and Pipeline Success Rate
  • Validate that faster merges still lead to stable delivery by pairing cycle time with deployment and incident signals

The goal is not to prove AI is good or bad in the abstract. The goal is to see what it does to your workflow, in your repos, with your quality benchmarks.

FAQ

What is a good cycle time target for an engineering team?

There is no universal target. A better approach is to set a baseline for each team and then aim to reduce the tail of the distribution. Many teams suffer because the slowest 10 to 20 percent of work items create most of the planning noise.

Should we measure cycle time at the individual developer level?

Usually no. Cycle time is shaped by system constraints such as review capacity, pipeline health, and work intake. Measuring individuals increases the risk of gaming and can push the organization toward activity metrics. Use team and service views for operational decisions, and reserve individual views for private coaching.

How is cycle time different from DORA lead time for changes?

Cycle time usually starts when active work begins. DORA change lead time measures from commit to deployed in production, which is closer to an end-to-end delivery view once code exists in version control. Many teams track both to separate engineering execution from release and deployment constraints.

What if AI makes us ship faster but introduces more bugs?

That is not a productivity gain. It is moving effort downstream into incidents and rework. Keep Change Failure Rate and New Bugs Per Dev Day on the same dashboard as cycle time so you can see whether the system is getting healthier.

Do we need an A/B test to measure AI impact?

A randomized experiment is ideal, but many teams can get useful results with a clean before and after comparison if definitions stay stable, work is segmented, and guardrails are tracked. The key is to treat measurement as a way to learn, not a way to declare victory.

If you want to know whether AI is helping, stop counting output volume. Track cycle time, break it into stages, and keep quality guardrails in the same view. That keeps the conversation tied to flow and customer impact, even as the tools change.