SafetyWing Alert Runbooks#
Operational runbooks for SafetyWing’s custom Prometheus alerts — the
component, environment, and platform tier rules we own (see
MONITORING.md for the tiering
model). Kubernetes control-plane / node-exporter alerts shipped by
kube-prometheus-stack are documented upstream at
runbooks.prometheus-operator.dev.
Every custom alert carries a runbook_url annotation that links here, and the
Alertmanager Slack message renders it as a 📖 runbook link.
How to use a runbook#
Each page follows the same shape:
- Meaning — what the alert detects and the exact expression.
- Impact — what’s degraded for users / the system while it fires.
- Diagnosis — commands to confirm and localize the problem.
- Mitigation — how to stop the bleeding and fix the root cause.
Conventions#
<env>is the environment (staging, a hatchery slug, …); component infra lives in namespacesafetywing-<env>-infra, apps insafetywing-<env>-applications.kubectlexamples target the hetzner cluster unless noted. Switch context withkubectl config use-context hetzner.- Alerts are labelled
severity(page/ticket/info),team,tier(component/environment/platform), and (where applicable)environment.
Browse the catalog in the sidebar, grouped by component.