Meaning#
The cluster-wide ratio of 5xx responses served by Traefik exceeds 5%. This is an edge-level signal aggregating all services behind Traefik.
Fires when: 5xx requests are more than 5% of all Traefik service requests.
sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
/ sum(rate(traefik_service_requests_total[5m])) > 0.05for: 10m, severity ticket, tier platform (cluster-wide, no environment label).
Impact#
A meaningful fraction of requests entering through the edge are failing with server errors. Because this is cluster-wide, it usually points at either a broadly-impacting backend (a shared dependency) or one high-traffic service skewing the aggregate. Users across one or more environments see errors.
Diagnosis#
Find which service(s) are emitting the 5xx — break the aggregate down by (service):
# Top 5xx contributors by absolute rate
topk(10, sum by (service) (rate(traefik_service_requests_total{code=~"5.."}[5m])))
# Per-service 5xx ratio (find the worst offenders)
sum by (service) (rate(traefik_service_requests_total{code=~"5.."}[5m]))
/ sum by (service) (rate(traefik_service_requests_total[5m]))
# Which 5xx codes (502/503/504 vs 500)
sum by (code) (rate(traefik_service_requests_total{code=~"5.."}[5m]))502/503/504 typically mean the backend is unreachable/unhealthy (Traefik can’t get a good response); 500 is usually the app itself erroring.
Inspect Traefik and the implicated backend:
kubectl config use-context hetzner
kubectl get pods -n traefik -o wide
kubectl logs -n traefik <traefik-pod> --tail=200 | grep -Ei '5[0-9][0-9]|error|backend'
# The service label maps to a namespace/service — inspect the backend pods
kubectl get pods -n <namespace> -o wide
kubectl describe pod -n <namespace> <pod>
kubectl logs -n <namespace> <pod> --tail=200Mitigation#
- Identify whether one service dominates (
topk by (service)) or it is broad. A single dominant service localizes the fix; broad 5xx points at a shared dependency. - For 502/503/504: check that backend pods are Ready and not OOMKilled/CrashLooping; check recent rollouts (
kubectl rollout status/history deploy/<name> -n <ns>) and roll back a bad deploy. - For 500: read the app logs for the failing service; the error is in the application, not the edge.
- If broad, check shared deps used by many services (databases, message brokers, auth) and the cluster control plane / node health.
- Verify it is not a traffic spike overwhelming undersized backends (compare 5xx rate against total request rate) — scale out if so.
- Root causes: bad deploy of a high-traffic service, backend OOM/crashloop, shared dependency outage, or capacity exhaustion under load.
References#
- Traefik metrics (Prometheus): https://doc.traefik.io/traefik/observability/metrics/prometheus/
- Traefik routing / services: https://doc.traefik.io/traefik/routing/services/
- HTTP 5xx status reference: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses