How to Track Technical Debt Introduced by AI Agents

All Posts
Share this post
Share this post

AI coding agents can write a lot of code fast. Speed is rarely the problem. The problem is accepting code that is expensive to change later. That cost shows up as slower reviews, fragile tests, more rework, and higher defect rates months after the "quick win".

Technical debt is the future cost you take on when today's code choices make tomorrow's changes harder (Martin Fowler's definition of technical debt). To track debt introduced by AI agents, label AI-assisted pull requests and compare a balanced scorecard for AI vs non-AI work: rework after review, revert and bug rates, PR size, and maintainability drift in the modules the agent touched. Treat the metrics as guardrails and tune prompts, templates, and reviews based on the data.

In this post, an AI agent means an LLM-powered tool that can plan changes and submit code, tests, or docs through a pull request.

What is technical debt and how can AI agents introduce it?

Technical debt is a useful metaphor because it forces a tradeoff conversation. You can take a shortcut today, ship sooner, then pay interest later in slower change cycles, higher risk, and more cleanup work. The metaphor traces back to Ward Cunningham's description of making pragmatic choices with future consequences (Ward Cunningham on technical debt).

The easiest way to make the concept actionable is to classify it. The technical debt quadrant is a practical model for separating deliberate debt from accidental debt, and separating reckless work from informed tradeoffs (Technical Debt Quadrant).

AI agents introduce technical debt in familiar ways, plus a few that are easy to miss:

  • Excess code: extra layers, wrappers, or helpers that make inhibit understanding the code now and changing it later
  • Duplication: similar logic copied across files because the agent does not recognize existing abstractions
  • Weak tests: tests that mirror implementation details and break on refactors, or tests that pass but miss edge cases
  • Dependency sprawl: new packages added to solve small problems, increasing upgrade and security overhead
  • Inconsistent design: code that works locally but drifts from your architectural conventions

If you want a simple definition for this post, use this: AI-introduced technical debt is any agent-generated change that increases future cost of change more than it increases today's delivery speed.

What should you measure to track AI-introduced technical debt?

You cannot measure technical debt directly. You measure signals that correlate with future change cost. That is why measurement validity matters. Kaner and Bond call out how easy it is to treat a surrogate measure as the thing you care about, then manage to the number instead of the outcome (Kaner and Bond, METRICS 2004).

Start with three buckets:

  • Flow friction: does AI code slow down review and integration?
  • Rework: do changes bounce back for fixes, rewrites, or follow up PRs?
  • Stability and maintainability: do AI changes correlate with bugs, incidents, or maintainability drift?

You will get better answers if you segment the data. Compare AI-assisted PRs to non-AI PRs and look at trends over time. Do not mix them and hope the average tells the story.

How do you label AI agent work so the data stays clean?

Tracking starts with a reliable, machine-readable signal that a change involved an agent. Without that, you end up arguing about anecdotes instead of measuring actual usage.

Use one primary signal and one backup signal:

  1. Source of truth in commit or PR metadata
    Example: a Co-authored-by: message, bot-authored commit, or other agent attribution recorded directly on the commit or PR. Because it travels with the change itself, this should be the primary signal.

  2. Human-applied PR label or template checkbox
    Example: a required checkbox like "AI-assisted" or "AI agent authored" that adds a label. This is useful as a secondary signal for review, reporting, or cleanup, but it should not be the system of record.

Then capture a small amount of context so you can act on what you learn:

  • The intent: link the PR to a ticket and a short problem statement
  • The boundaries: what the agent was allowed to change (files, modules, dependencies)
  • The prompt or plan: a short summary or a link to an internal doc

You do not need perfect attribution. You need consistent attribution. A stable process beats a clever heuristic.

Which metrics reveal technical debt earlier than backlog size?

Backlog size is a lagging indicator. By the time the backlog screams, the debt already changed your daily work. Look for earlier signals in review friction, rework, and stability.

The table below is a practical starting set. It includes minware workflow and quality metrics plus a few codebase signals. Segment every metric by AI-assisted vs non-AI work.

Metric What it signals How to use it for AI debt tracking Decision cue
PR Lead Time for Changes How long code takes to reach main after a PR opens. Compare AI-assisted vs non-AI PR lead time and trend it. Rising lead time on AI PRs often means review friction or cleanup work. If AI PR lead time rises while non-AI is flat, tighten guardrails: smaller PRs, better context, and better test requirements.
Review Latency Time from PR open to first meaningful review. AI can increase PR volume. If latency climbs, reviewers are overloaded and debt slips through. If latency rises, limit parallel agent PRs and route AI PRs through a review rotation.
Work In Progress per Person How much parallel work people carry. Agent output can raise parallel PR and ticket load. Rising WIP usually means more context switching and longer queues. If WIP rises, cap the number of concurrent AI PRs and finish review and merge before starting new agent work.
Post PR Review Dev Day Ratio How much work happens after review starts. High post review work on AI PRs is a strong signal of mismatched intent, unclear requirements, or low quality output. If post review work is high, add a design note requirement and require the agent to generate tests and rationale before code.
Never Merged Ratio Time spent on work that never merges. Agent work that gets abandoned is debt in disguise. It burns time and leaves half finished branches and decisions. If never merged work rises, narrow agent scope and require smaller, reviewable increments.
Large Branch Rate Oversized PRs that are hard to review. Agents tend to generate big diffs. Big diffs hide debt and raise merge risk. If large PRs trend up, cap PR size for agent work and make the agent split changes by concern.
Pipeline Success Rate How reliable your CI is. Flaky tests are technical debt. Agents can add brittle tests quickly. If success rate drops on AI PRs, quarantine flaky tests and require fixes before expanding agent scope.
Change Failure Rate High severity bugs per change. Use this as a trailing signal. Segment by AI-touched changes to see if agent output correlates with higher production bugs. If AI-touched changes have a higher failure rate, slow down agent use on critical paths and increase review depth.
New Bugs Per Dev Day How often new bugs are created relative to dev time. This normalizes for bursts of output. It helps when AI increases throughput. If bugs per dev day rises after agent adoption, treat it as interest cost and rebalance toward quality work.
Rework churn in 14 to 30 days How much recently merged code gets rewritten soon after. Measure percent of lines in an AI PR that change again within 14 to 30 days. High rework churn is classic debt. If rework churn is high, focus agents on smaller refactors and add stronger acceptance tests.
Maintainability drift in touched modules Complexity and readability trending worse where agents work. Track cyclomatic complexity deltas, duplication, and maintainability index trends in modules frequently changed by agents. If maintainability drops, invest in refactoring and add architectural constraints to agent prompts.
Security findings per change Risky patterns that increase future remediation cost. Run secure coding checks and track findings by PR. Use NIST SSDF and OWASP as guardrails for what matters most. If findings rise, add secure coding requirements and block risky patterns in CI.

Notes on maintainability and security:

  • Composite metrics can look precise while hiding assumptions. Treat them as signals, not truth. The maintainability index is a common example of an attractive number that still needs skepticism (think twice before using the Maintainability Index).
  • For security baselines, the Secure Software Development Framework is a solid reference for process level expectations (NIST SSDF). For common application risks, use OWASP's Top 10 as a shared vocabulary (OWASP Top 10).

How to build an AI technical debt scorecard

This is a lightweight setup you can do with existing repo, PR, and CI data. Aim for visibility. Perfection is unnecessary.

Category Minimum metrics Pattern to watch
Review friction Review Latency, PR Lead Time for Changes AI-assisted PRs trend slower than non-AI for several weeks.
Rework Post PR Review Dev Day Ratio, rework churn in 14 to 30 days AI-assisted changes need repeated follow up PRs to reach "done".
Delivery hygiene Large Branch Rate, Never Merged Ratio More oversized PRs or abandoned branches after agent adoption.
Stability Change Failure Rate, New Bugs Per Dev Day Production bugs rise after throughput increases.
Build health Pipeline Success Rate CI becomes flaky after agents start adding or updating tests.

Implementation steps:

  1. Pick your scope
    Start with one repo or one team. Include both AI-assisted and non-AI work.

  2. Define your AI label
    Use commit or PR metadata as the primary signal, such as a Co-authored-by: trailer, and a human-applied PR label as the backup.

  3. Baseline the last 8 to 12 weeks
    Capture median values for the scorecard. Keep the distributions, not just averages.

  4. Segment the metrics
    Create two views: AI-labeled PRs and everything else.

  5. Add one outcome link
    Pick one operational outcome you already track, like severity 1 incidents or high priority bug creation. Tie it back to changes.

  6. Set guardrails, not targets
    Use thresholds as prompts for investigation. Avoid tying these to individual performance. Teams game metrics when incentives demand it.

  7. Create a feedback loop
    When you see friction or rework, change one constraint: PR size limits, test requirements, dependency policy, or review workflow. Recheck the trends in two sprints.

If you need a reality check for speed and stability tradeoffs, the DORA research program is a useful baseline for balanced delivery metrics.

Gotchas and limitations when you use metrics to track technical debt

Confounders and incentives break metrics more often than math.

Selection bias is the big one. Teams often start agents on low risk code, scripts, or internal tooling. That can make early metrics look great. When agents move into core code, the pattern changes. Track scope changes explicitly.

Size dominates many signals. AI PRs are often larger. Larger PRs tend to take longer to review and are more likely to hide issues. Compare like with like by adding a size bucket or controlling for diff size.

Counting debt by static analysis alone misses process debt. Agent adoption also creates prompt debt, evaluation harness debt, and review policy debt. Those show up in flow and rework metrics earlier than in lint results.

Metrics change behavior. If you reward low bug counts or high throughput, people will optimize for the number. Keep these metrics for decision making and coaching instead of treating them like leaderboards.

Suggested charts to spot AI-introduced technical debt early

  1. PR lead time split by AI label over time
    Plot PR Lead Time for Changes for AI-assisted vs non-AI PRs by week. Add PR size as a filter.

  2. Post review rework vs PR size for AI-assisted PRs
    Plot with Post PR Review Dev Day Ratio on one axis and PR size on the other. Outliers tell you where the process is failing.

  3. Bug and revert rate for AI-touched changes
    Trend Change Failure Rate and rollback or revert events for AI-touched merges. This helps separate review annoyance from real production risk.

FAQ about technical debt and AI agents

Can you measure technical debt directly?

No. Technical debt is a cost concept, not a single observable measure. Track signals that predict future change cost, then validate them against outcomes you care about.

What is the best metric to start with?

Start with rework signals. In most teams, Post PR Review Dev Day Ratio plus Rework Rate will tell you quickly whether agent output is creating cleanup work.

Should we block AI agents from writing production code?

Do not decide that globally. Decide by risk. Use agents on low risk surfaces first, then expand scope only if your scorecard stays stable.

How often should we review the scorecard?

Weekly is enough for flow metrics. Quality and stability trends often need at least a few weeks of data. The key is consistency.

What do we do when the metrics point to rising debt?

Start with constraints that reduce blast radius: smaller PRs, stronger tests, and limits on new dependencies. Then tune prompts and add review checklists so the agent produces work your system can accept.

Tracking technical debt introduced by AI agents is a measurement problem and a feedback loop problem. Label the work, segment the data, and watch for rework and stability signals before the backlog turns into a crisis.