How AI Coding Tools Exposed Broken Dev Productivity Metrics

All Posts
Share this post
Share this post

AI coding tools changed how developers work much faster than most measurement systems changed how they measure. It is now trivial to generate or refactor large amounts of code from a short prompt. While measuring lines of code has been a questionable practice for the last 40 years, we still see people asking about it. At this point, any simple metrics around “activity” will lead the industry even further astray. Dashboards showcasing lines of code, commit counts, or tickets closed, will move up and to the right while customer outcomes, reliability, and delivery speed barely improve. AI tools didn’t break these metrics, they’ve just revealed how broken those measures have always been.

If leaders keep reading those activity numbers as “productivity,” they will:

  • Overestimate the impact of AI tools
  • Misallocate headcount and budget
  • Reward teams for visible motion instead of real progress

Modern developer productivity metrics need to separate AI-assisted activity from real improvements in delivery speed, quality, and reliability.

This guide assumes you already track basic delivery metrics. It focuses on:

  • Which traditional metrics broke under AI coding tools
  • Which outcome-oriented metrics still work
  • How to measure AI impact using data you likely already have

Why did AI coding tools break traditional developer productivity metrics?

These legacy developer productivity metrics focused on activity, not outcomes:

  • Lines of code produced
  • Commits or pull requests per engineer
  • Tickets or story points completed per sprint

They were fragile and misleading even before AI. Now they are even worse.

Lines of code

AI assistants can generate large blocks of code from a short prompt. That inflates lines of code without telling you whether the code is useful, maintainable, or even correct. And to be fair, lines of code has always been a less-than-useful metric.

Measurement researchers have warned for decades that metrics must be tightly linked to the underlying attribute they claim to measure, or they will be gamed and misinterpreted. LOC has very weak construct validity for “productivity,” and AI tools widened that gap.

Commits and pull requests per developer

Commit and PR counts track how often someone pushes code, not whether the work is valuable.

AI tools change commit patterns in both directions. Some developers batch large AI‑generated changes into a single PR, others iterate through many small AI‑driven experiments.

In either case, more commits does not reliably mean better throughput or healthier systems.

Tickets and story points closed

Ticket throughput looks encouraging when AI speeds up boilerplate work. Teams slice work smaller, close more tickets, and velocity charts go up.

But ticket metrics still:

  • Don’t distinguish between high‑leverage features and low‑value chores
  • Ignore follow‑on defect work and rework
  • Can show “more done” while lead times for meaningful changes and customer satisfaction don’t budge

Time on task and AI usage

New AI‑adjacent metrics are also mostly activity signals:

  • Time in IDE
  • Number of AI prompts
  • Percentage of AI‑generated code

These can be interesting diagnostics but say little about productivity on their own.

A recent randomized trial of experienced developers working on familiar open‑source codebases found that tasks actually took about nineteen percent longer with AI tools, even though participants believed AI made would make them roughly twenty percent faster. Activity and perception diverged sharply from real outcomes.

That pattern matches what many leaders report anecdotally. AI tools change where time goes and how development feels. They do not automatically deliver faster or better outcomes.

Any metric that equates “more visible activity” with “more productivity” is now even more unreliable.

Which developer productivity metrics still work with AI coding tools?

The fundamentals of good measurement did not change. You still want developer productivity metrics that capture:

  • Flow of changes from idea to production
  • Quality and reliability of what ships
  • Sustainable workload and developer experience

DORA’s research on the “four keys” shows that deployment frequency, lead time, change failure rate, and time to restore service predict both organizational performance and team well‑being when defined carefully. Those metrics continue to work in an AI world because they focus on outcomes, not typing speed.

Here is a concise view of which metrics remain useful and how to interpret them alongside AI coding tools.

Goal Robust metric What it tells you AI-specific interpretation
Delivery speed Lead Time for Changes Time from first commit or PR to production deploy. If coding feels faster but lead time is flat, AI is not improving end-to-end flow. The bottleneck sits in review, testing, or release, not typing.
Throughput Deployment frequency How often production receives changes. Higher frequency with steady quality suggests real productivity gains. Higher frequency with rising incident counts signals risky use of AI.
Quality Change Failure Rate, Open Bugs CFR is the share of releases that cause incidents. Open bugs track unresolved defects. If CFR and open defect counts climb when AI use goes up, the team is probably over-trusting AI output or skipping depth in review.
Flow through review Review Latency, Pull Request Size Time to first review and the size of each change. AI that generates overly large or complex diffs will push PR size and review latency up, even if individual contributors feel faster.
Capacity mix Dev Work Days How much time engineers spend actively building and integrating code. Healthy AI adoption usually keeps or increases Dev Work Days while shifting work from boilerplate coding toward design, integration, and review.
System stability Pipeline Success Rate Share of CI runs that pass without human intervention. AI-generated code that frequently breaks pipelines will drive this rate down and consume reviewer and SRE time.

These metrics align well with how experienced leaders are already measuring AI impact. Surveys of over 180 companies collected by Laura Tacho and Gergely Orosz show that most teams still lean on DORA-style metrics, PR cycle time, and developer experience measures, then add AI usage data as another slice rather than inventing entirely new KPIs.

How should leaders measure AI impact on productivity?

Once you have robust developer productivity metrics, treat AI adoption as a controlled process change, not magic. The question becomes: does AI improve these metrics for your context?

1. Establish a clean baseline

Before rolling AI tools out widely, capture several weeks or a quarter of:

Document how each metric is defined.

Kaner and Bond highlight that shifting definitions over time makes trend lines meaningless and encourages people to chase whatever version of a number is easiest to achieve.

2. Tag and segment AI-assisted work

You do not need perfect attribution, but you do need at least two cohorts:

  • Work where AI coding tools contributed heavily
  • Work that was mostly manual

Teams featured in The Pragmatic Engineer’s reporting often segment by repository or team, then compare AI adopters and non‑adopters on the same metrics while controlling for process differences. You can also add a simple AI assisted checkbox to pull requests as a low‑friction signal. Some tools such as Claude Code list the AI Agent as a co-author on commits which provides easy access to this signal.

From there, you can ask concrete questions:

The METR randomized study mentioned earlier found that experienced developers took about nineteen percent longer to complete tasks with AI tools, even though they expected to be faster. That does not mean AI is bad. It does mean that intuition and self‑reported speed are not sufficient. Pulling metrics before and after adoption makes these effects visible.

4. Pair speed with quality and experience

Never interpret a speed metric in isolation. A useful rule of thumb:

The key is treating AI as a tool whose value must show up in the same outcomes you cared about before.

Which metrics should you retire or demote?

Retire entirely as performance indicators:

  • Lines of code written should have been retired in 1985
  • Commits or pull requests per person
  • Hours in IDE or number of AI prompts

Demote to supporting context only:

  • Tickets closed and story points completed per sprint
  • Percent of code written by AI

These can still be interesting, but only when read alongside real developer productivity metrics. They help answer “what changed?” after you observe an outcome shift, not “how are we performing?” on their own.

Keep as headline metrics:

This set gives you a stable, AI‑resistant foundation.

AI coding tools did not make measurement impossible. They exposed which developer productivity metrics were weak and which still tell the truth. Leaders who ground their decisions in resilient metrics will see clearly where AI is helping, where it is hurting, and where they need to invest next.

Using tools like minware with AI-aware metrics

Platforms that already integrate repository, pull request, ticket, and pipeline data can help leaders see AI effects without inventing new dashboards.

For example, in minware you can:

The important point is not the specific tool, but the discipline of using your existing developer productivity metrics as the lens for evaluating AI coding tools.

FAQ: AI coding tools and developer productivity metrics

Do AI coding tools make DORA metrics obsolete?

No. DORA metrics remain some of the most reliable indicators of delivery performance because they focus on outcomes rather than activity (DORA four key metrics guide). AI may change how you achieve good lead times and low failure rates, but the metrics themselves still describe what “good” looks like.

Should we measure how much code is written by AI?

You can track AI usage as context, but treat it as a descriptive metric. High AI contribution is only positive if delivery and quality metrics improve at the same time. Making “percent AI generated” a target invites gaming and does not guarantee better outcomes.

Can story points and velocity still be used?

Yes, as planning tools. They should not be used as primary productivity metrics, especially with AI involved. Use them to forecast and then judge execution using flow and quality metrics such as Lead Time for Changes and Change Failure Rate.

Should we measure lines of code produced?

No.

Should have ever measured lines of code produced?

No.

How often should we revisit our developer productivity metrics with AI in the stack?

Review metric definitions at least annually and after major process or tooling changes. Ensure boundaries for Lead Time for Changes, Change Failure Rate, and other key metrics still match how your teams work, and that they are not inadvertently rewarding behaviors that AI can inflate without delivering value.