Mean Time to Restore (MTTR)

Mean Time to Restore (MTTR) measures the average amount of time it takes to recover from an unplanned outage or service disruption. It reflects how quickly teams detect, respond to, and resolve production issues that impact availability.

Calculation

A restoration is typically defined as the point at which full service functionality has been restored following a user-impacting failure. The start time may be based on alerting, incident creation, or customer-reported impact. The end time should reflect actual resolution, not just mitigation.

This metric is calculated by dividing the total duration of all incidents by the number of incidents during the same time period:

mttr = total incident resolution time ÷ number of incidents

Goals

MTTR helps teams evaluate the efficiency of their incident response process. It answers questions like:

  • How quickly are we resolving production outages?
  • Are our monitoring, escalation, and on-call systems functioning well?
  • Are we learning from failures and shortening recovery time over time?

Reducing MTTR improves user experience, operational confidence, and system resilience. For more background, see Google's SRE guidance on incident response.

Variations

MTTR is sometimes referred to as Mean Time to Recovery or Mean Time to Repair, particularly in IT service management and SRE contexts. These terms are often used interchangeably.

Common segmentations include:

  • By severity, comparing time to restore for critical (P1) vs. lower-priority incidents
  • By service, to detect which systems recover slowly
  • By incident type, such as regressions, infrastructure failures, or third-party issues

Some teams report median or 95th percentile MTTR to avoid distortion from extreme cases. Others break down MTTR by response phase (detection, triage, resolution) for deeper operational insight.

Limitations

MTTR measures how long incidents last but not how often they happen. A team with excellent MTTR but frequent disruptions may still be delivering an unreliable experience.

The metric also does not indicate whether recovery was stable or temporary. Teams may “restore” service quickly without resolving the root cause, which leads to recurring incidents or hidden degradation.

To get a more complete view of operational resilience, use MTTR alongside:

Complementary Metric Why It’s Relevant
Mean Time Between Failures (MTBF) Reveals how often service disruptions occur between periods of normal operation.
Change Failure Rate Identifies whether releases are introducing instability that drives incident recovery.
Incident Volume Provides insight into the overall rate of disruption and alert noise.

Optimization

Improving MTTR means reducing the time it takes to detect, diagnose, and resolve production failures.

  • Improve observability and alerting. Early detection is critical to fast recovery. Invest in monitoring coverage, actionable alerts, and structured logging to surface symptoms as early as possible.

  • Clarify on-call and escalation paths. Teams respond faster when ownership is clear. Maintain updated on-call rotations, paging policies, and Incident Response playbooks to avoid delays during triage.

  • Establish rapid recovery paths. Use tools like rollback automation, config toggles, or Feature Flags to mitigate user impact quickly while root cause analysis continues.

  • Run recovery rehearsals. Use Game Days and chaos experiments to practice recovery scenarios and identify failure points in your incident handling process.

  • Conduct structured postmortems. MTTR improves over time when teams treat each failure as a learning opportunity. Analyze what slowed detection, response, or communication, and use those insights to fix systemic gaps.

Lowering MTTR is not just about moving fast under pressure. It is about building systems and processes that support reliable, confident recovery without sacrificing safety or clarity.