The Runbook That Worked
“Confidence at 3am comes from a good runbook, not a brave heart.”
First week on call. The page came at 3am, the way first pages always seem to. CRITICAL: order-queue depth above threshold and climbing.
A year ago this is where the cold sweat starts. Instead, I tapped the link in the alert.
What we did right
Every alert we send carries a runbook link, and the runbook is not a vague wiki page written once and never opened. It is a short, specific checklist, kept honest because we actually use it during incidents and fix it afterward when it is wrong.
This one opened with the three most likely causes in order of probability. It named the exact dashboard to open, with a direct link. It listed the safe first action and the command to run it, plus the line that mattered most to a nervous new hire: here is how to tell if it is working, and here is who to wake if it is not.
The moment
I followed it like a recipe. The dashboard showed a stuck consumer, cause number one on the list. The runbook said to restart the consumer group with a specific command and watch the queue depth start falling within two minutes.
I ran it. The graph turned the corner. The queue drained. My heart rate returned to something survivable.
The result
Ten minutes, start to finish, alone, on my first week, at 3am. I did not wake anyone. I did not guess. I did not have to be a hero, because the people who built the system had already done the heroics in advance, calmly, in daylight, by writing down what they knew. The next morning I added one small clarification to the runbook, so the next person would have it even easier.