The Label We Did Not Add
“The best incident review is the one for the incident you prevented.”
Someone opened a pull request that added a single label to a counter. The label was user_id, so we could break traffic down per user. It looked harmless, the way these things always do.
This is the part of the story where, on a less careful team, the metrics backend falls over at 2am and takes monitoring for the whole company with it. That is not what happened here.
What we did right
We had read enough horror stories to take cardinality seriously. So the guardrail lived exactly where the mistake gets made, in the pull request itself.
A small metrics linter ran in CI. It knew that a counter is one time series per unique combination of label values, and it knew that certain label names, user_id and request_id and email and anything that looks like a full URL, are unbounded by nature. It saw the new label and failed the build with a clear message: this label is high cardinality and will create one series per user.
The check was not a bureaucratic wall. It linked to a short internal note explaining the risk and pointing to the right tool for per-user breakdowns, which was logs, not metrics.
The moment
The author read the message, understood immediately, and moved the per-user breakdown into a log query where high cardinality is free and expected. The reviewer added a friendly one-liner confirming the reasoning. The whole exchange took five minutes and lived entirely in the pull request.
The result
There were eight million users, which means there was very nearly an eight million series explosion and a company-wide monitoring outage. Instead there was a green build, a small improvement to a dashboard, and an engineer who now understands cardinality a little better. The label we did not add is one of our favorite incidents, precisely because it never became one.