Meaning#

The 5xx error ratio for a single environment exceeded its configured threshold. This is an environment-tier SLO alert emitted per environment by the safetywing-environment chart, scoped to that environment’s application services.

Fires when: the environment’s 5xx ratio crosses the threshold while traffic is above a minimum RPS floor (the minRps guard avoids alerting on noise at low traffic).

# hetzner / Traefik form
sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*",code=~"5.."}[5m]))
  / sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*"}[5m])) > <ratio>
and
sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*"}[5m])) > <minRps>

On GKE environments the ingress is nginx, so the source metric differs:

# GKE / nginx form (same shape, nginx metric)
sum(rate(nginx_ingress_controller_requests{namespace="safetywing-<env>-applications",status=~"5.."}[5m]))
  / sum(rate(nginx_ingress_controller_requests{namespace="safetywing-<env>-applications"}[5m])) > <ratio>

for: 5m, severity from chart values, tier environment. Carries an environment label identifying the affected env.

Impact#

Users of this specific environment are seeing server errors above the acceptable rate. Scope is one environment — a bad deploy or a sick upstream dependency in that env is the usual cause. Other environments are unaffected unless they share infrastructure.

Diagnosis#

Read the environment label off the alert, then find which app service emits the 5xx — break it down by (service):

# hetzner: per-service 5xx rate within the environment
topk(10, sum by (service) (
  rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*",code=~"5.."}[5m])
))

# GKE: per-ingress 5xx rate within the environment
topk(10, sum by (ingress, service) (
  rate(nginx_ingress_controller_requests{namespace="safetywing-<env>-applications",status=~"5.."}[5m])
))

Inspect the implicated app pods in the environment’s applications namespace:

# hetzner cluster (GKE: use the relevant gke context)
kubectl config use-context hetzner
kubectl get pods -n safetywing-<env>-applications -o wide
kubectl describe pod -n safetywing-<env>-applications <pod>
kubectl logs -n safetywing-<env>-applications <pod> --tail=200
kubectl logs -n safetywing-<env>-applications <pod> --previous --tail=200   # if it restarted

Check recent deploys / rollouts (a bad release is the most common cause):

kubectl rollout status deploy/<service> -n safetywing-<env>-applications
kubectl rollout history deploy/<service> -n safetywing-<env>-applications
kubectl get events -n safetywing-<env>-applications --sort-by=.lastTimestamp | tail -30

Check upstream dependencies for that environment (MySQL/RabbitMQ/Kafka):

kubectl get pods -n safetywing-<env>-infra -o wide
kubectl logs -n safetywing-<env>-infra <db-or-broker-pod> --tail=100

Mitigation#

  1. Localize: which service dominates the 5xx (topk by (service) above)? Usually a single app.
  2. If a deploy lines up with the onset, roll it back: kubectl rollout undo deploy/<service> -n safetywing-<env>-applications.
  3. Read the failing service’s logs for the error class — app exception (500) vs upstream unreachable (502/503/504).
  4. If upstream-driven, check safetywing-<env>-infra (MySQL/MOCO, RabbitMQ, Kafka) for the failing dependency and restore it; the app 5xx should clear once the dependency recovers.
  5. If capacity-driven (5xx rises with load), scale the affected deployment.
  6. Confirm the alert is not just crossing the minRps floor on a low-traffic env — if RPS is barely above the floor and errors are a handful, it may be noise.
  7. Root causes: bad env deploy, app exceptions, upstream dependency outage in *-infra, or capacity exhaustion.

References#