Engineering Metrics That Work In Legacy Codebases

All Posts
Share this post
Share this post

Legacy systems make high‑performing teams look slow and messy on paper. Lead time is longer, incidents are harder to diagnose, and refactors ship in small slices that never show up as shiny new features. If leadership evaluates these teams with the same developer productivity metrics used for greenfield products, the result is chronic underinvestment and pressure to take unsafe shortcuts.

The fix is not to “make the legacy team go faster.” It’s to change what you measure. In brownfield engineering, value shows up as stability, safer changes, and steady modernization.

This guide outlines which engineering metrics still work in legacy codebases, which ones mislead, and how to construct a metric set that rewards disciplined modernization instead of punishing it.

Why legacy code distorts standard productivity metrics

Most organizations call a system “legacy” when it’s business‑critical but built on older technologies or architectures. Michael Feathers popularized a more operational definition that many teams still use: legacy code is code without tests, because you cannot change it safely without extra effort.

Typical legacy traits:

  • Sparse or outdated tests and documentation
  • Tight coupling between modules and databases
  • Shared environments and brittle deployment processes
  • Long tails of obscure edge cases that only a few people understand

In that environment, the inadequacy of traditional activity metrics becomes undeniable. Lines of code, commit counts, and tickets closed never reflected real progress and in legacy systems, the gap is impossible to miss. A legacy team might:

  • Merge fewer pull requests than a greenfield team
  • Ship fewer net‑new features
  • Spend more time on analysis, regression testing, and coordination

…while delivering more business value by reducing incident risk, paying down critical technical debt, or completing migrations that unblock future work.

Measurement research has warned about this for years: metrics must be tied to the quality attributes they claim to represent, or they will drive gaming and local optimization instead of genuine improvement.

For legacy systems, that raises a simple question: “what does good look like?” and “which metrics actually show you’re getting there?”

What “good” looks like in a legacy codebase

For brownfield engineering, success usually falls into four buckets:

  1. Stability and reliability
    Fewer high‑severity incidents, faster recovery, less unplanned downtime.

  2. Safe delivery of change
    Ability to ship fixes and incremental improvements without destabilizing the system.

  3. Modernization progress
    Reduction in the surface area of legacy components and dependencies that block new work.

  4. Predictable effort
    Clear expectations for how much time legacy systems consume so roadmap planning is realistic.

For legacy‑heavy teams, these signals matter more than raw feature throughput.

Software engineering metrics that matter for legacy systems

The table below lists metrics that work well in legacy environments and how to read them.

Category Metric Why it works for legacy What “good” looks like
Stability Incident Volume Tracks how often the legacy system breaks in production. Incident counts trend down or stay low while change volume stays steady.
Stability Mean Time to Restore (MTTR) Measures how quickly the team recovers from failures. Median restoration time falls as runbooks, observability, and automation improve.
Quality Open Bugs by age band Reveals unresolved risk and how long defects linger in legacy modules. 90+ day bugs in core legacy areas shrink steadily; new serious bugs are triaged quickly.
Quality Change Failure Rate scoped to legacy components Shows how safely you can change fragile areas. CFR gradually approaches your high‑performer range as tests and rollout practices mature.
Flow Lead Time for Changes for legacy vs new Separates delivery speed on old and modernized paths. Legacy lead time remains higher than greenfield, but improves quarter over quarter.
Flow Sprint Rollover Rate for legacy tickets Measures predictability for legacy work. Rollover decreases as stories are sliced smaller and riskier changes are isolated.
Modernization Migration coverage Percentage of traffic or functionality handled by modern components. Core flows move from legacy to new services; remaining legacy surface area shrinks.
Modernization Technical debt paydown velocity Amount of debt‑removal work completed per quarter. Debt work consistently ships and correlates with fewer incidents and shorter lead times.
Capacity Dev Work Days by legacy vs non‑legacy work Shows how much capacity the legacy system consumes. Share of time spent on firefighting shrinks while modernization and feature work grows.

Together, these metrics show whether legacy systems are getting safer, easier to change, and smaller in scope without penalizing teams for the extra care fragile code requires.

How to instrument stability and incident metrics

Legacy systems often generate a disproportionate share of outages and user‑visible defects. That’s where measurement should start.

Focus your stability view on:

  • Incident Volume broken down by service, severity, and cause category
  • Mean time to detect (MTTD) and mean time to restore (MTTR)
  • Time spent in degraded or partially available states
  • Recurring incident themes involving the same legacy modules

If incidents and MTTR are improving while overall change volume stays steady, your hardening work is paying off, even if feature throughput looks modest.

Measuring quality in code that predates modern practices

In legacy systems, defect patterns matter more than raw counts.

Use Open Bugs by age band (0–30, 31–90, 90+ days) and tag them by module. A backlog full of old, high‑severity bugs in critical legacy areas is a clearer risk indicator than a large number of fresh minor issues.

Complement that with Change Failure Rate calculated only for deployments that touch legacy components. DORA uses CFR as a core stability metric and shows that elite and high performers keep it low while still shipping frequently. For legacy work, trend beats absolute value:

  • Short term: CFR may be higher than for modern services
  • Medium term: CFR should fall as tests, rollbacks, and canary releases improve

If CFR is flat or rising for legacy components, that’s a signal to invest in testability and safer rollout patterns before adding more change.

Flow metrics that treat legacy work fairly

Legacy changes often take longer because of extra analysis, regression testing, and coordination, not because the team is slow. Flow metrics should make that explicit instead of averaging everything together.

For Lead Time for Changes, separate:

  • Changes that touch only legacy modules
  • Changes that touch only modern services
  • Cross‑cutting changes that span both

This split does three things:

  1. Shows whether modernization is actually making new work easier.
  2. Sets realistic expectations on legacy‑heavy tasks.
  3. Highlights when lead time spikes indicate process problems, not inherent legacy risk.

Similarly, track Sprint Rollover Rate specifically for legacy tickets. High rollover there usually means:

  • Risk is being discovered late
  • Stories are scoped too broadly
  • There isn’t enough time carved out for investigation and hardening

Labels like legacy or brownfield in your planning tools make these patterns visible and help you adjust capacity and slicing.

Tracking technical debt paydown without hiding feature work

Technical debt metrics are easy to game. Static analysis scores or hours spent cleaning code rarely convince anyone outside engineering.

Instead, make debt work visible in how it changes other metrics.

Use Dev Work Days plus simple task labels to separate:

  • Firefighting and incident response
  • Keeping‑the‑lights‑on changes (config edits, tiny fixes)
  • Debt removal, refactors, and migrations
  • Net‑new feature work

Then correlate shifts in that mix with trends in:

How tools like minware help legacy‑heavy teams

Manually tracking where time goes is error‑prone.

Because minware reconstructs Dev Work Days from commit activity, pull requests, and tickets across repos, it can show:

  • How much engineering time flows into legacy modules vs modern services
  • How Lead Time for Changes and Sprint Rollover Rate differ between legacy and new work
  • Whether incident and bug trends are improving in the parts of the system in which you’re paying down technical debt

By combining that time model with metrics like Incident Volume and Open Bugs, leaders can see whether the organization is truly paying down risk or just moving effort around.

Implementation checklist

To put these ideas into practice this quarter:

  1. Tag legacy work explicitly

    • Mark legacy components in code and label legacy‑related tickets in your tracker.
  2. Split headline metrics by legacy vs modern

  3. Stand up a stability dashboard for legacy

  4. Define 1–2 modernization KPIs

    • For example, migration coverage and decommissioned assets. Review them monthly.
  5. Use these metrics during planning, don’t wait for postmortems

    • Let stability and modernization data guide investment levels, not just root‑cause analysis after things break.

The goal is to make the value of brownfield engineering as visible as greenfield feature delivery. Tie these trends directly to modernization initiatives and roadmap decisions using the metric set outlined above. That makes brownfield investment comparable to any other strategic bet, rather than a vague “keep the lights on” cost.

FAQ

What counts as legacy code for metric purposes?

Treat code as legacy when changes require extra risk mitigation compared to your modern stack, because of missing tests/documentation, outdated technology, tight coupling, or all three. A practical definition is “code that cannot be safely changed without significant additional effort,” which aligns with Feathers’s description of legacy code as code without tests.

Should legacy teams be held to the same DORA targets?

Use DORA benchmarks as direction, not hard thresholds. The key is that change failure rate and lead time for legacy components improve over time, even if they remain higher than for modern services.

How can we show modernization ROI to executives?

Frame metrics in terms of risk and capacity:

  • Fewer incidents and shorter outages
  • Lower Change Failure Rate on legacy components
  • A shrinking share of Dev Work Days spent on firefighting and legacy maintenance