Meaning#
The 5xx error ratio for a single environment exceeded its configured threshold. This is an environment-tier SLO alert emitted per environment by the safetywing-environment chart, scoped to that environment’s application services.
Fires when: the environment’s 5xx ratio crosses the threshold while traffic is above a minimum RPS floor (the minRps guard avoids alerting on noise at low traffic).
# hetzner / Traefik form
sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*",code=~"5.."}[5m]))
/ sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*"}[5m])) > <ratio>
and
sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*"}[5m])) > <minRps>On GKE environments the ingress is nginx, so the source metric differs:
# GKE / nginx form (same shape, nginx metric)
sum(rate(nginx_ingress_controller_requests{namespace="safetywing-<env>-applications",status=~"5.."}[5m]))
/ sum(rate(nginx_ingress_controller_requests{namespace="safetywing-<env>-applications"}[5m])) > <ratio>for: 5m, severity from chart values, tier environment. Carries an environment label identifying the affected env.
Impact#
Users of this specific environment are seeing server errors above the acceptable rate. Scope is one environment — a bad deploy or a sick upstream dependency in that env is the usual cause. Other environments are unaffected unless they share infrastructure.
Diagnosis#
Read the environment label off the alert, then find which app service emits the 5xx — break it down by (service):
# hetzner: per-service 5xx rate within the environment
topk(10, sum by (service) (
rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*",code=~"5.."}[5m])
))
# GKE: per-ingress 5xx rate within the environment
topk(10, sum by (ingress, service) (
rate(nginx_ingress_controller_requests{namespace="safetywing-<env>-applications",status=~"5.."}[5m])
))Inspect the implicated app pods in the environment’s applications namespace:
# hetzner cluster (GKE: use the relevant gke context)
kubectl config use-context hetzner
kubectl get pods -n safetywing-<env>-applications -o wide
kubectl describe pod -n safetywing-<env>-applications <pod>
kubectl logs -n safetywing-<env>-applications <pod> --tail=200
kubectl logs -n safetywing-<env>-applications <pod> --previous --tail=200 # if it restartedCheck recent deploys / rollouts (a bad release is the most common cause):
kubectl rollout status deploy/<service> -n safetywing-<env>-applications
kubectl rollout history deploy/<service> -n safetywing-<env>-applications
kubectl get events -n safetywing-<env>-applications --sort-by=.lastTimestamp | tail -30Check upstream dependencies for that environment (MySQL/RabbitMQ/Kafka):
kubectl get pods -n safetywing-<env>-infra -o wide
kubectl logs -n safetywing-<env>-infra <db-or-broker-pod> --tail=100Mitigation#
- Localize: which
servicedominates the 5xx (topk by (service)above)? Usually a single app. - If a deploy lines up with the onset, roll it back:
kubectl rollout undo deploy/<service> -n safetywing-<env>-applications. - Read the failing service’s logs for the error class — app exception (500) vs upstream unreachable (502/503/504).
- If upstream-driven, check
safetywing-<env>-infra(MySQL/MOCO, RabbitMQ, Kafka) for the failing dependency and restore it; the app 5xx should clear once the dependency recovers. - If capacity-driven (5xx rises with load), scale the affected deployment.
- Confirm the alert is not just crossing the
minRpsfloor on a low-traffic env — if RPS is barely above the floor and errors are a handful, it may be noise. - Root causes: bad env deploy, app exceptions, upstream dependency outage in
*-infra, or capacity exhaustion.
References#
- safetywing-environment chart (infra-charts): https://github.com/safetywing/infra-charts
- Traefik metrics (Prometheus): https://doc.traefik.io/traefik/observability/metrics/prometheus/
- ingress-nginx Prometheus metrics: https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/