The Alert That Saved the Weekend

It was 3pm on a Friday when the alert fired. Not a page. A low-urgency notification in the team channel, exactly as designed: “checkout-svc memory trending toward limit, ~6 hours of runway.”

What we did right

Months earlier we had stopped alerting only on the cliff. A pod hitting its memory limit and getting killed is a cliff, and by then it is already too late to do anything graceful. Instead we added an alert on the slope: if memory was climbing steadily and the projection crossed the limit within a working day, tell us now, gently, while there was still time to think.

It felt almost too quiet when we set it up. No drama. Just a quiet measurement of how much runway we had left.

The moment

The Friday alert pointed at a slow leak introduced in that morning’s deploy. Left alone, the projection put the first pod death at roughly 1am Saturday, with the rest of the fleet following through the night as traffic shifted onto the survivors. A classic weekend-ruiner.

Instead, three of us looked at the heap profile over coffee, found the unbounded cache, shipped a one-line fix by 4pm, and watched the memory line flatten into a calm horizontal road.

The result

Nobody got paged. Nobody lost a Saturday. The fix went out in daylight, reviewed and unhurried, instead of at 2am by someone half awake. The whole thing was a non-event, which is exactly what made it a triumph. We celebrated the outage that never happened, because the alert that saved the weekend was the one that fired while the weekend was still savable.

What we did right

The moment

The result

More calm to borrow