The Blameless Postmortem That Changed Everything

The outage itself was ordinary. A config change took down a service for forty minutes during business hours. What happened afterward is why this is a paradise story and not a horror one.

What we did right

The engineer who made the change came to the postmortem braced to be blamed. That is the instinct every blame culture trains into people, and it is exactly the instinct that makes the next outage worse, because people who fear blame hide problems until they explode.

We ran it blameless, on purpose, the way we always do. The question on the table was never who did this. It was: how did our system make this mistake easy to make and hard to catch? A skilled, careful engineer following the normal process caused this, so the process and the system are where the answers live.

The conversation

Once it was safe to be honest, the real story came out quickly. The config had no validation, so a typo deployed cleanly. There was no canary for config changes, only for code. And the rollback procedure was undocumented, which is why forty minutes were spent rediscovering it live.

None of that is a personal failing. All of it is fixable.

The result

Three concrete action items came out of that hour, each with an owner and a date: schema validation for configs, canary rollouts extended to config, and a written, tested rollback runbook. All three shipped within two weeks.

The deeper result was cultural. Word got around that the postmortem had been safe, even useful. People started raising risks before they became incidents, saying “this thing scares me” in design reviews. The outage made the whole system stronger, which only happened because we treated the person as the ally, not the cause.

What we did right

The conversation

The result

More calm to borrow