My “Four-(Sometimes Five-) Bullet” Incident Snapshot

I’ve written more incident docs than I’ll ever remember, and the ones that cause the fewest quesitons all open with the same, brutally short rubric:

Detection speed – slow / medium / fast
Mitigation speed – slow / medium / fast
Remediation speed – N/A / slow / medium / fast
Impact scope – low / medium / high
Communication speed – N/A / slow / medium / fast

That’s the whole elevator pitch: How long did we fly blind? How long were users hurt? How long until systems were clean? How big was the blast radius? How quickly did we tell people?

I kinda think of this like the retrospective or past-tense of a four-point situation report.

Terms

Why “mitigation,” not “resolution”?

“Resolution” usually means completely finished (see ITIL, StatusPage). Bullet #2 is only “the bleeding stopped,” not “root cause removed.” Industry vernacular calls that mitigation (think MTTM (mean time to mitigate) in Google SRE books), so I do too.

Rough Aces

Defining the axes (recap)

Detection
- Fast < 5 m · Med 5-30 m · Slow > 30 m
Mitigation
- Fast < 15 m · Med 15-60 m · Slow > 60 m
Remediation
- Fast < 24 h · Med 24 h-7 d · Slow > 7 d · N/A
Impact
- Low < 1 % traffic · Med 1-10 % · High > 10 % (pick one driver)
Communication
- Fast < 10 m · Med 10-30 m · Slow > 30 m · N/A

Context

Where this snapshot fits in the wider world:

Snapshot axis	Google SRE / DORA	PagerDuty / Atlassian IR	NIST PICERL	What you add
Detection	MTTD	“Detection”	Identification	—
Mitigation	MTTM / Containment	Mitigation	Containment	Clear boundary before root-fix
Remediation	MTTR (restore)	Resolution	Eradication + Recovery	Explicit timer to “clean state”
Impact	Severity label (implied)	Sev label	Severity	Explicit numeric / % metric
Comms	Time-to-Ack / First Update	TTFU (First Update)	Notification Time	Optional but visible

Think of the snapshot as the common denominator of those frameworks without the ceremony. If Finance or Legal later need cost of impact or regulatory notification timestamps, I link to that detail in the retro instead of bloating the headline.

Example

Example snapshot

Detection:     Medium — 7 min (Grafana checkout-error alarm)
Mitigation:    Fast   — flag rollback at 12 min
Remediation:   Medium — schema reverted in 36 h
Impact:        High   — 22 % of checkouts, ≈ $2.6 M stalled GMV
Comms:         Fast   — StatusPage + Support macro at 9 min

One screen, five numbers, whole story.

Template

Steal-this-template

### Incident Snapshot
- Detection speed:     <fast|medium|slow> — X min (source)
- Mitigation speed:    <fast|medium|slow> — Y min (action)
- Remediation speed:   <fast|medium|slow|N/A> — Z h/d (action)
- Impact scope:        <low|medium|high> — metric
- Communication speed: <fast|medium|slow|N/A> — N min (first stakeholder update)  <!-- optional -->

Paste it, fill it in under two minutes, and get back to fixing things that matter. Incidents are inevitable but muddy recaps aren’t.

Reference

Permalink (2025.BLG.116)
On 2025, July 25, Friday
In Blog / Practicing
Tagged metrics, software-engineering, incident-management
Edit

← Previous	Next →
The Fantastic Four: First Steps	Afternoon Run 🏃

Widgets

Network Graph

Legend

Key	Action
`o`	Source
`e`	Edit
`i`	Insight
`r`	Random
`h`	Home
`s` or `/`	Search