Tail Sampling Caught the One in Two Thousand

Somewhere in our traffic, roughly one order in two thousand was coming out wrong. Rare enough to never dent an aggregate graph. Common enough to matter. The kind of bug that, on a less careful system, hides for months.

What we did right

We had been down the cheap road before and learned from it. Uniform head sampling, deciding at the start of each request whether to keep its trace, optimizes for exactly the requests you do not need: the boring, successful, average ones. The rare failure and the low sample rate almost never meet.

So we ran tail-based sampling. Every trace is buffered until the request finishes, and only then do we decide. Errors are always kept. Slow requests above a latency threshold are always kept. The ordinary successes are sampled down to keep the bill sane. The decision waits until we actually know whether the request is interesting.

The moment

The corruption reports came in. We opened the trace explorer and filtered to errors on the order path. They were all there, every single failing request, because failing requests are never sampled away in our setup.

Within a few minutes the pattern was obvious: the bug only appeared when a particular promo code combined with a particular currency, sending the request down a rarely used branch that mishandled rounding. We had the exact trace, the exact span, the exact inputs.

The result

A bug that statistical bad luck could have hidden for a quarter was understood before lunch and fixed the same day. We still sample the boring traffic and the bill is still reasonable. We just never let a cost knob quietly decide that the rare and the broken were the first things worth discarding.