The Canary That Caught It

“Let one percent of users find the bug, automatically, so the other ninety-nine never do.”

The deploy went out at 11am on a Tuesday and it was broken. A serialization change that passed every test still mishandled a real-world payload shape we did not have in our fixtures. On a less careful day, this is a full outage.

It lasted about ninety seconds, for one percent of traffic, and then it healed itself.

What we did right

We had long since stopped shipping to everyone at once. Every deploy goes to a canary first: a small slice of real production traffic, usually one percent, running the new version beside the stable one.

The rollout controller watches the canary against the baseline on the signals that matter, error rate and latency. Not absolute thresholds, which are always wrong, but a comparison. If the new version is meaningfully worse than the old one on live traffic, the rollout stops and reverses itself. No human required to make the call in the moment.

The moment

The instant the canary took real traffic, its error rate jumped while the baseline stayed flat. The controller noticed the divergence within a minute, marked the canary unhealthy, halted the rollout, and shifted the one percent back to the stable version.

By the time the notification reached our channel, it was already past tense: โ€œcanary for order-svc rolled back, error rate regression versus baseline.โ€ A report, not an emergency.

The result

One percent of users saw errors for about a minute. The other 99 percent never knew a thing. We fixed the payload handling, added the missing case to our fixtures, and shipped again that afternoon, this time green through the canary and out to everyone. The canary caught what our tests could not, because the only complete test environment is production, and the safe way to test there is a little at a time.

← Back to the light Share your story