The Page That Never Came

I carried the pager for a week and it never once went off at night. I slept every night of my rotation. For anyone who has done on-call the old way, that sentence sounds like fiction.

What we did right

A year before, our rotation was a nightmare of noise. Dozens of alerts, most of them meaningless, all of them trained into us until we slept through our phones. So we did the unglamorous work.

We deleted every alert nobody could explain. We rewrote the rest to fire on user-facing symptoms tied to our SLOs, not on internal twitches like a single pod restarting or a brief CPU spike that fixed itself. We tied urgency to the error budget: if a problem was not burning the budget fast enough to threaten the SLO, it could wait for morning, and it was allowed to wait for morning.

Every surviving alert had to answer one question. Is this worth waking a human? If not, it did not exist.

The week

Things still happened. A node failed and was replaced automatically. A dependency got briefly slow and recovered inside its budget. A batch job ran long. The system absorbed all of it without needing me, and logged it neatly for the morning review.

The one ticket that did need attention arrived politely at 10am, well inside business hours, with enough budget left that nobody had to rush.

The result

A full rotation, no night pages, and a team that once again believes its pager. The page that never came was not an accident. It was the direct result of deciding, on purpose, that our sleep was worth protecting and that an alert nobody acts on is worse than no alert at all.

What we did right

The week

The result

More calm to borrow