TraefikDown • SafetyWing Runbooks

Meaning#

Prometheus has no healthy Traefik scrape target. Either the Traefik edge ingress is down, or just its metrics endpoint/scraping is broken.

Fires when: there is no up == 1 series for any Traefik scrape job.

absent(up{job=~".*traefik.*"} == 1)

for: 10m, severity page, tier platform (cluster-wide, no environment label).

Impact#

Traefik is the edge ingress (a hostNetwork DaemonSet on the Hetzner cluster). If Traefik itself is down, all external HTTP(S) traffic into the cluster fails — every public domain behind it. If only metrics are broken, traffic may still flow but we are blind to edge health and 5xx alerting is degraded.

Diagnosis#

First decide: is Traefik actually down, or just unscraped?

kubectl config use-context hetzner
kubectl get pods -n traefik -o wide              # DaemonSet pods, one per (eligible) node
kubectl get ds -n traefik
kubectl get svc,endpoints -n traefik

If pods are missing/CrashLooping:

kubectl describe pod -n traefik <pod>
kubectl logs -n traefik <pod> --tail=200
kubectl logs -n traefik <pod> --previous --tail=200   # last crash

Confirm whether it is a scrape-only problem (ServiceMonitor / --metrics.prometheus):

kubectl get servicemonitor -A | grep -i traefik
# Verify the metrics flag is enabled in the Traefik args
kubectl get ds -n traefik -o yaml | grep -i metrics

Check the scrape target state in Prometheus UI (prom-ep.hetzner.safetywing.dev → Status → Targets), then via PromQL:

up{job=~".*traefik.*"}                 # 0 or absent = problem
count(up{job=~".*traefik.*"} == 1)     # how many healthy targets remain

Probe the edge from outside to distinguish a real outage from a metrics gap:

curl -sS -o /dev/null -w '%{http_code}\n' https://grafana.safetywing.dev

Mitigation#

If external traffic is failing (curl to a known domain fails), treat as an edge outage — restart the DaemonSet: kubectl rollout restart ds/traefik -n traefik and watch pods come Ready.
If pods are CrashLooping, read the logs — common causes are a bad static/dynamic config, a failed TLS cert resolver, or a hostNetwork port conflict on the node.
If pods are Running and traffic flows but the target is down, the problem is scraping: confirm --metrics.prometheus is enabled, the metrics port is exposed, and the ServiceMonitor/endpoints match. Fix the ServiceMonitor/labels rather than the ingress.
If only some nodes lost their Traefik pod (hostNetwork DaemonSet), check node readiness and taints — a NotReady node drops its edge pod.
Root causes: Traefik crash from bad config, node failure removing DaemonSet pods, metrics flag/ServiceMonitor regression, port conflict on hostNetwork, or Prometheus-side scrape misconfig.

References#

Traefik metrics (Prometheus): https://doc.traefik.io/traefik/observability/metrics/prometheus/
Traefik on Kubernetes: https://doc.traefik.io/traefik/providers/kubernetes-ingress/
Prometheus absent() / up: https://prometheus.io/docs/prometheus/latest/querying/functions/#absent