TraefikHigh5xxRate • SafetyWing Runbooks

Meaning#

The cluster-wide ratio of 5xx responses served by Traefik exceeds 5%. This is an edge-level signal aggregating all services behind Traefik.

Fires when: 5xx requests are more than 5% of all Traefik service requests.

sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
  / sum(rate(traefik_service_requests_total[5m])) > 0.05

for: 10m, severity ticket, tier platform (cluster-wide, no environment label).

Impact#

A meaningful fraction of requests entering through the edge are failing with server errors. Because this is cluster-wide, it usually points at either a broadly-impacting backend (a shared dependency) or one high-traffic service skewing the aggregate. Users across one or more environments see errors.

Diagnosis#

Find which service(s) are emitting the 5xx — break the aggregate down by (service):

# Top 5xx contributors by absolute rate
topk(10, sum by (service) (rate(traefik_service_requests_total{code=~"5.."}[5m])))

# Per-service 5xx ratio (find the worst offenders)
sum by (service) (rate(traefik_service_requests_total{code=~"5.."}[5m]))
  / sum by (service) (rate(traefik_service_requests_total[5m]))

# Which 5xx codes (502/503/504 vs 500)
sum by (code) (rate(traefik_service_requests_total{code=~"5.."}[5m]))

502/503/504 typically mean the backend is unreachable/unhealthy (Traefik can’t get a good response); 500 is usually the app itself erroring.

Inspect Traefik and the implicated backend:

kubectl config use-context hetzner
kubectl get pods -n traefik -o wide
kubectl logs -n traefik <traefik-pod> --tail=200 | grep -Ei '5[0-9][0-9]|error|backend'

# The service label maps to a namespace/service — inspect the backend pods
kubectl get pods -n <namespace> -o wide
kubectl describe pod -n <namespace> <pod>
kubectl logs -n <namespace> <pod> --tail=200

Mitigation#

Identify whether one service dominates (topk by (service)) or it is broad. A single dominant service localizes the fix; broad 5xx points at a shared dependency.
For 502/503/504: check that backend pods are Ready and not OOMKilled/CrashLooping; check recent rollouts (kubectl rollout status/history deploy/<name> -n <ns>) and roll back a bad deploy.
For 500: read the app logs for the failing service; the error is in the application, not the edge.
If broad, check shared deps used by many services (databases, message brokers, auth) and the cluster control plane / node health.
Verify it is not a traffic spike overwhelming undersized backends (compare 5xx rate against total request rate) — scale out if so.
Root causes: bad deploy of a high-traffic service, backend OOM/crashloop, shared dependency outage, or capacity exhaustion under load.

References#

Traefik metrics (Prometheus): https://doc.traefik.io/traefik/observability/metrics/prometheus/
Traefik routing / services: https://doc.traefik.io/traefik/routing/services/
HTTP 5xx status reference: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses