
My “Four-(Sometimes Five-) Bullet” Incident Snapshot
I’ve written more incident docs than I’ll ever remember, and the ones that cause the fewest quesitons all open with the same, brutally short rubric:
- Detection speed – slow / medium / fast
- Mitigation speed – slow / medium / fast
- Remediation speed – N/A / slow / medium / fast
- Impact scope – low / medium / high
- Communication speed – N/A / slow / medium / fast
That’s the whole elevator pitch: How long did we fly blind? How long were users hurt? How long until systems were clean? How big was the blast radius? How quickly did we tell people?
I kinda think of this like the retrospective or past-tense of a four-point situation report.
Terms
Why “mitigation,” not “resolution”?
“Resolution” usually means completely finished (see ITIL, StatusPage). Bullet #2 is only “the bleeding stopped,” not “root cause removed.” Industry vernacular calls that mitigation (think MTTM (mean time to mitigate) in Google SRE books), so I do too.
Rough Aces
Defining the axes (recap)
- Detection
- Fast < 5 m · Med 5-30 m · Slow > 30 m
- Mitigation
- Fast < 15 m · Med 15-60 m · Slow > 60 m
- Remediation
- Fast < 24 h · Med 24 h-7 d · Slow > 7 d · N/A
- Impact
- Low < 1 % traffic · Med 1-10 % · High > 10 % (pick one driver)
- Communication
- Fast < 10 m · Med 10-30 m · Slow > 30 m · N/A
Context
Where this snapshot fits in the wider world:
Snapshot axis | Google SRE / DORA | PagerDuty / Atlassian IR | NIST PICERL | What you add |
---|---|---|---|---|
Detection | MTTD | “Detection” | Identification | — |
Mitigation | MTTM / Containment | Mitigation | Containment | Clear boundary before root-fix |
Remediation | MTTR (restore) | Resolution | Eradication + Recovery | Explicit timer to “clean state” |
Impact | Severity label (implied) | Sev label | Severity | Explicit numeric / % metric |
Comms | Time-to-Ack / First Update | TTFU (First Update) | Notification Time | Optional but visible |
Think of the snapshot as the common denominator of those frameworks without the ceremony. If Finance or Legal later need cost of impact or regulatory notification timestamps, I link to that detail in the retro instead of bloating the headline.
Example
Example snapshot
Detection: Medium — 7 min (Grafana checkout-error alarm)
Mitigation: Fast — flag rollback at 12 min
Remediation: Medium — schema reverted in 36 h
Impact: High — 22 % of checkouts, ≈ $2.6 M stalled GMV
Comms: Fast — StatusPage + Support macro at 9 min
One screen, five numbers, whole story.
Template
Steal-this-template
### Incident Snapshot
- Detection speed: <fast|medium|slow> — X min (source)
- Mitigation speed: <fast|medium|slow> — Y min (action)
- Remediation speed: <fast|medium|slow|N/A> — Z h/d (action)
- Impact scope: <low|medium|high> — metric
- Communication speed: <fast|medium|slow|N/A> — N min (first stakeholder update) <!-- optional -->
Paste it, fill it in under two minutes, and get back to fixing things that matter. Incidents are inevitable but muddy recaps aren’t.
Josh BeckmanReference
- Blog / Practicing
- metrics, software-engineering, incident-management
-
Permalink to
2025.BLG.115
- Edit
← Previous | Next → |
My Graham Evaluation | Claude Code Notifications for Async Programming |
Widgets
Updated: |
v2.16.0-r495-gf13bb6b5
|