
My “Four-(Sometimes Five-) Bullet” Incident Snapshot
I’ve written more incident docs than I’ll ever remember, and the ones that cause the fewest quesitons all open with the same, brutally short rubric:
- Detection speed – slow / medium / fast
- Mitigation speed – slow / medium / fast
- Remediation speed – N/A / slow / medium / fast
- Impact scope – low / medium / high
- Communication speed – N/A / slow / medium / fast
That’s the whole elevator pitch: How long did we fly blind? How long were users hurt? How long until systems were clean? How big was the blast radius? How quickly did we tell people?
Terms
Why “mitigation,” not “resolution”?
“Resolution” usually means completely finished (see ITIL, StatusPage). Bullet #2 is only “the bleeding stopped,” not “root cause removed.” Industry vernacular calls that mitigation (think MTTM (mean time to mitigate) in Google SRE books), so I do too.
Rough Aces
Defining the axes (recap)
- Detection
- Fast < 5 m · Med 5-30 m · Slow > 30 m
- Mitigation
- Fast < 15 m · Med 15-60 m · Slow > 60 m
- Remediation
- Fast < 24 h · Med 24 h-7 d · Slow > 7 d · N/A
- Impact
- Low < 1 % traffic · Med 1-10 % · High > 10 % (pick one driver)
- Communication
- Fast < 10 m · Med 10-30 m · Slow > 30 m · N/A
Context
Where this snapshot fits in the wider world:
Snapshot axis | Google SRE / DORA | PagerDuty / Atlassian IR | NIST PICERL | What you add |
---|---|---|---|---|
Detection | MTTD | “Detection” | Identification | — |
Mitigation | MTTM / Containment | Mitigation | Containment | Clear boundary before root-fix |
Remediation | MTTR (restore) | Resolution | Eradication + Recovery | Explicit timer to “clean state” |
Impact | Severity label (implied) | Sev label | Severity | Explicit numeric / % metric |
Comms | Time-to-Ack / First Update | TTFU (First Update) | Notification Time | Optional but visible |
Think of the snapshot as the common denominator of those frameworks without the ceremony. If Finance or Legal later need cost of impact or regulatory notification timestamps, I link to that detail in the retro instead of bloating the headline.
Example
Example snapshot
Detection: Medium — 7 min (Grafana checkout-error alarm)
Mitigation: Fast — flag rollback at 12 min
Remediation: Medium — schema reverted in 36 h
Impact: High — 22 % of checkouts, ≈ $2.6 M stalled GMV
Comms: Fast — StatusPage + Support macro at 9 min
One screen, five numbers, whole story.
Template
Steal-this-template
### Incident Snapshot
- Detection speed: <fast|medium|slow> — X min (source)
- Mitigation speed: <fast|medium|slow> — Y min (action)
- Remediation speed: <fast|medium|slow|N/A> — Z h/d (action)
- Impact scope: <low|medium|high> — metric
- Communication speed: <fast|medium|slow|N/A> — N min (first stakeholder update) <!-- optional -->
Paste it, fill it in under two minutes, and get back to fixing things that matter. Incidents are inevitable but muddy recaps aren’t.
Josh BeckmanReference
Widgets
Updated: |
v2.15.0-r159-gd2a15db5
|