How AI Coding Tools Exposed Broken Dev Productivity Metrics
AI coding tools changed how developers work much faster than most measurement systems changed how they measure. It is now trivial to generate or refactor large amounts of code from a short prompt. While measuring lines of code has been a questionable practice for the last 40 years, we still see people asking about it. At this point, any simple metrics around “activity” will lead the industry even further astray. Dashboards showcasing lines of code, commit counts, or tickets closed, will move up and to the right while customer outcomes, reliability, and delivery speed barely improve. AI tools didn’t break these metrics, they’ve just revealed how broken those measures have always been.
If leaders keep reading those activity numbers as “productivity,” they will:
- Overestimate the impact of AI tools
- Misallocate headcount and budget
- Reward teams for visible motion instead of real progress
Modern developer productivity metrics need to separate AI-assisted activity from real improvements in delivery speed, quality, and reliability.
This guide assumes you already track basic delivery metrics. It focuses on:
- Which traditional metrics broke under AI coding tools
- Which outcome-oriented metrics still work
- How to measure AI impact using data you likely already have
Why did AI coding tools break traditional developer productivity metrics?
These legacy developer productivity metrics focused on activity, not outcomes:
- Lines of code produced
- Commits or pull requests per engineer
- Tickets or story points completed per sprint
They were fragile and misleading even before AI. Now they are even worse.
Lines of code
AI assistants can generate large blocks of code from a short prompt. That inflates lines of code without telling you whether the code is useful, maintainable, or even correct. And to be fair, lines of code has always been a less-than-useful metric.
Measurement researchers have warned for decades that metrics must be tightly linked to the underlying attribute they claim to measure, or they will be gamed and misinterpreted. LOC has very weak construct validity for “productivity,” and AI tools widened that gap.
Commits and pull requests per developer
Commit and PR counts track how often someone pushes code, not whether the work is valuable.
AI tools change commit patterns in both directions. Some developers batch large AI‑generated changes into a single PR, others iterate through many small AI‑driven experiments.
In either case, more commits does not reliably mean better throughput or healthier systems.
Tickets and story points closed
Ticket throughput looks encouraging when AI speeds up boilerplate work. Teams slice work smaller, close more tickets, and velocity charts go up.
But ticket metrics still:
- Don’t distinguish between high‑leverage features and low‑value chores
- Ignore follow‑on defect work and rework
- Can show “more done” while lead times for meaningful changes and customer satisfaction don’t budge
Time on task and AI usage
New AI‑adjacent metrics are also mostly activity signals:
- Time in IDE
- Number of AI prompts
- Percentage of AI‑generated code
These can be interesting diagnostics but say little about productivity on their own.
A recent randomized trial of experienced developers working on familiar open‑source codebases found that tasks actually took about nineteen percent longer with AI tools, even though participants believed AI made would make them roughly twenty percent faster. Activity and perception diverged sharply from real outcomes.
That pattern matches what many leaders report anecdotally. AI tools change where time goes and how development feels. They do not automatically deliver faster or better outcomes.
Any metric that equates “more visible activity” with “more productivity” is now even more unreliable.
Which developer productivity metrics still work with AI coding tools?
The fundamentals of good measurement did not change. You still want developer productivity metrics that capture:
- Flow of changes from idea to production
- Quality and reliability of what ships
- Sustainable workload and developer experience
DORA’s research on the “four keys” shows that deployment frequency, lead time, change failure rate, and time to restore service predict both organizational performance and team well‑being when defined carefully. Those metrics continue to work in an AI world because they focus on outcomes, not typing speed.
Here is a concise view of which metrics remain useful and how to interpret them alongside AI coding tools.
| Goal | Robust metric | What it tells you | AI-specific interpretation |
|---|---|---|---|
| Delivery speed | Lead Time for Changes | Time from first commit or PR to production deploy. | If coding feels faster but lead time is flat, AI is not improving end-to-end flow. The bottleneck sits in review, testing, or release, not typing. |
| Throughput | Deployment frequency | How often production receives changes. | Higher frequency with steady quality suggests real productivity gains. Higher frequency with rising incident counts signals risky use of AI. |
| Quality | Change Failure Rate, Open Bugs | CFR is the share of releases that cause incidents. Open bugs track unresolved defects. | If CFR and open defect counts climb when AI use goes up, the team is probably over-trusting AI output or skipping depth in review. |
| Flow through review | Review Latency, Pull Request Size | Time to first review and the size of each change. | AI that generates overly large or complex diffs will push PR size and review latency up, even if individual contributors feel faster. |
| Capacity mix | Dev Work Days | How much time engineers spend actively building and integrating code. | Healthy AI adoption usually keeps or increases Dev Work Days while shifting work from boilerplate coding toward design, integration, and review. |
| System stability | Pipeline Success Rate | Share of CI runs that pass without human intervention. | AI-generated code that frequently breaks pipelines will drive this rate down and consume reviewer and SRE time. |
These metrics align well with how experienced leaders are already measuring AI impact. Surveys of over 180 companies collected by Laura Tacho and Gergely Orosz show that most teams still lean on DORA-style metrics, PR cycle time, and developer experience measures, then add AI usage data as another slice rather than inventing entirely new KPIs.
How should leaders measure AI impact on productivity?
Once you have robust developer productivity metrics, treat AI adoption as a controlled process change, not magic. The question becomes: does AI improve these metrics for your context?
1. Establish a clean baseline
Before rolling AI tools out widely, capture several weeks or a quarter of:
- Lead Time for Changes
- Deployment frequency
- Change Failure Rate and Open Bugs
- Review Latency and Pull Request Size
- Pipeline Success Rate and Dev Work Days
Document how each metric is defined.
Kaner and Bond highlight that shifting definitions over time makes trend lines meaningless and encourages people to chase whatever version of a number is easiest to achieve.
2. Tag and segment AI-assisted work
You do not need perfect attribution, but you do need at least two cohorts:
- Work where AI coding tools contributed heavily
- Work that was mostly manual
Teams featured in The Pragmatic Engineer’s reporting often segment by repository or team, then compare AI adopters and non‑adopters on the same metrics while controlling for process differences. You can also add a simple AI assisted checkbox to pull requests as a low‑friction signal. Some tools such as Claude Code list the AI Agent as a co-author on commits which provides easy access to this signal.
3. Compare trends, not single points
From there, you can ask concrete questions:
- Did Lead Time for Changes drop for AI‑heavy work relative to baseline?
- Did Change Failure Rate rise when AI usage increased?
- Did Review Latency or Pull Request Size spike for AI‑assisted PRs?
- Did Dev Work Days shift from new feature work toward bug fixes and firefighting?
The METR randomized study mentioned earlier found that experienced developers took about nineteen percent longer to complete tasks with AI tools, even though they expected to be faster. That does not mean AI is bad. It does mean that intuition and self‑reported speed are not sufficient. Pulling metrics before and after adoption makes these effects visible.
4. Pair speed with quality and experience
Never interpret a speed metric in isolation. A useful rule of thumb:
- Any time you celebrate improvements in Lead Time for Changes or deployment frequency, check Change Failure Rate, Open Bugs, and incident trends in the same window.
- Any time you see AI helping developers feel less burdened, check whether Dev Work Days spent on unplanned work is shrinking or just shifting.
The key is treating AI as a tool whose value must show up in the same outcomes you cared about before.
Which metrics should you retire or demote?
Retire entirely as performance indicators:
- Lines of code written should have been retired in 1985
- Commits or pull requests per person
- Hours in IDE or number of AI prompts
Demote to supporting context only:
- Tickets closed and story points completed per sprint
- Percent of code written by AI
These can still be interesting, but only when read alongside real developer productivity metrics. They help answer “what changed?” after you observe an outcome shift, not “how are we performing?” on their own.
Keep as headline metrics:
- Lead Time for Changes and deployment frequency for speed
- Change Failure Rate, Open Bugs, and incident rates for quality
- Review Latency, Pull Request Size, Pipeline Success Rate, and Dev Work Days for flow and capacity
This set gives you a stable, AI‑resistant foundation.
AI coding tools did not make measurement impossible. They exposed which developer productivity metrics were weak and which still tell the truth. Leaders who ground their decisions in resilient metrics will see clearly where AI is helping, where it is hurting, and where they need to invest next.
Using tools like minware with AI-aware metrics
Platforms that already integrate repository, pull request, ticket, and pipeline data can help leaders see AI effects without inventing new dashboards.
For example, in minware you can:
- Break down Lead Time for Changes into coding, review, and deployment stages to see where AI is actually moving the needle
- Compare Review Latency, Pull Request Size, and Pipeline Success Rate for AI‑tagged and non‑AI‑tagged work items
- Use Dev Work Days to understand whether AI is reducing toil or just shifting effort to different stages
The important point is not the specific tool, but the discipline of using your existing developer productivity metrics as the lens for evaluating AI coding tools.
FAQ: AI coding tools and developer productivity metrics
Do AI coding tools make DORA metrics obsolete?
No. DORA metrics remain some of the most reliable indicators of delivery performance because they focus on outcomes rather than activity (DORA four key metrics guide). AI may change how you achieve good lead times and low failure rates, but the metrics themselves still describe what “good” looks like.
Should we measure how much code is written by AI?
You can track AI usage as context, but treat it as a descriptive metric. High AI contribution is only positive if delivery and quality metrics improve at the same time. Making “percent AI generated” a target invites gaming and does not guarantee better outcomes.
Can story points and velocity still be used?
Yes, as planning tools. They should not be used as primary productivity metrics, especially with AI involved. Use them to forecast and then judge execution using flow and quality metrics such as Lead Time for Changes and Change Failure Rate.
Should we measure lines of code produced?
No.
Should have ever measured lines of code produced?
No.
How often should we revisit our developer productivity metrics with AI in the stack?
Review metric definitions at least annually and after major process or tooling changes. Ensure boundaries for Lead Time for Changes, Change Failure Rate, and other key metrics still match how your teams work, and that they are not inadvertently rewarding behaviors that AI can inflate without delivering value.