SafetyWing Alert Runbooks#

Operational runbooks for SafetyWing’s custom Prometheus alerts — the component, environment, and platform tier rules we own (see MONITORING.md for the tiering model). Kubernetes control-plane / node-exporter alerts shipped by kube-prometheus-stack are documented upstream at runbooks.prometheus-operator.dev.

Every custom alert carries a runbook_url annotation that links here, and the Alertmanager Slack message renders it as a 📖 runbook link.

How to use a runbook#

Each page follows the same shape:

  • Meaning — what the alert detects and the exact expression.
  • Impact — what’s degraded for users / the system while it fires.
  • Diagnosis — commands to confirm and localize the problem.
  • Mitigation — how to stop the bleeding and fix the root cause.

Conventions#

  • <env> is the environment (staging, a hatchery slug, …); component infra lives in namespace safetywing-<env>-infra, apps in safetywing-<env>-applications.
  • kubectl examples target the hetzner cluster unless noted. Switch context with kubectl config use-context hetzner.
  • Alerts are labelled severity (page/ticket/info), team, tier (component/environment/platform), and (where applicable) environment.

Browse the catalog in the sidebar, grouped by component.