Borrow the calm
The practices behind the calm
The good habits behind the stories, explained plainly. Each one is a way to make observability work for you instead of against you. Here is how they fit together.
- What is observability?
-
Observability is the ability to understand what a system is doing internally based only on the signals it emits from the outside: its metrics, logs, and traces. A system is observable when you can answer new questions about its behavior without shipping new code to ask them.
- What are the four golden signals?
-
The four golden signals are latency, traffic, errors, and saturation. Watching these four for any user-facing service tells you most of what you need to know about its health, which is why a trustworthy dashboard leads with them and little else.
See it in action: The Dashboard That Told the Truth
- What is an SLO and an error budget?
-
A Service Level Objective is a target for reliability, such as 99.9 percent of requests succeeding over 30 days. The error budget is the small amount of failure that target allows. Tying alert urgency to the error budget means you only get woken when reliability is genuinely at risk.
See it in action: The Page That Never Came
- What is symptom-based alerting?
-
Symptom-based alerting fires on what users actually feel, such as high latency or error rate, rather than on internal causes like a single restart or CPU spike. It produces fewer, more meaningful pages, because every alert maps to real user pain worth waking someone for.
See it in action: The Page That Never Came
- Why should a dashboard show data freshness?
-
Stale data that has stopped updating looks identical to a healthy flat line. A freshness badge showing how old each panel's data is, plus an alert when a metric stops arriving, lets you tell a calm system apart from a dead pipeline at a glance.
See it in action: The Dashboard That Told the Truth
- What is the difference between head and tail sampling?
-
Sampling keeps only a fraction of traces to control cost. Head sampling decides at the start of a request, before you know if it failed, so rare errors are usually discarded. Tail sampling decides after the request finishes, so you can always keep the slow and failed requests that matter most.
See it in action: Tail Sampling Caught the One in Two Thousand
- How do you prevent a cardinality explosion?
-
Every unique combination of label values becomes its own time series, so an unbounded label like a user ID can create millions. A guardrail in code review, such as a linter that rejects high-cardinality labels, stops the explosion before it ships. Keep unbounded identifiers in logs and traces.
See it in action: The Label We Did Not Add
- What is a canary release?
-
A canary release sends a new version to a small slice of real traffic first, running it beside the stable version. A controller compares their error rate and latency, and rolls back automatically if the new version is worse, so a bad deploy reaches only a fraction of users.
See it in action: The Canary That Caught It
- What makes a good runbook?
-
A good runbook is a short, specific checklist linked directly from the alert. It names the likely causes in order, the exact dashboard to open, the safe first action with the command to run, how to tell it worked, and who to escalate to. It turns a scary page into a calm, well-lit walk.
See it in action: The Runbook That Worked
- What is a blameless postmortem?
-
A blameless postmortem investigates how the system allowed a mistake, not who made it. By treating skilled people as allies rather than causes, it surfaces the real contributing factors, produces concrete fixes, and builds the psychological safety that makes people report risks early.
See it in action: The Blameless Postmortem That Changed Everything