Meaning#
Prometheus has no healthy Traefik scrape target. Either the Traefik edge ingress is down, or just its metrics endpoint/scraping is broken.
Fires when: there is no up == 1 series for any Traefik scrape job.
absent(up{job=~".*traefik.*"} == 1)for: 10m, severity page, tier platform (cluster-wide, no environment label).
Impact#
Traefik is the edge ingress (a hostNetwork DaemonSet on the Hetzner cluster). If Traefik itself is down, all external HTTP(S) traffic into the cluster fails — every public domain behind it. If only metrics are broken, traffic may still flow but we are blind to edge health and 5xx alerting is degraded.
Diagnosis#
First decide: is Traefik actually down, or just unscraped?
kubectl config use-context hetzner
kubectl get pods -n traefik -o wide # DaemonSet pods, one per (eligible) node
kubectl get ds -n traefik
kubectl get svc,endpoints -n traefikIf pods are missing/CrashLooping:
kubectl describe pod -n traefik <pod>
kubectl logs -n traefik <pod> --tail=200
kubectl logs -n traefik <pod> --previous --tail=200 # last crashConfirm whether it is a scrape-only problem (ServiceMonitor / --metrics.prometheus):
kubectl get servicemonitor -A | grep -i traefik
# Verify the metrics flag is enabled in the Traefik args
kubectl get ds -n traefik -o yaml | grep -i metricsCheck the scrape target state in Prometheus UI (prom-ep.hetzner.safetywing.dev → Status → Targets), then via PromQL:
up{job=~".*traefik.*"} # 0 or absent = problem
count(up{job=~".*traefik.*"} == 1) # how many healthy targets remainProbe the edge from outside to distinguish a real outage from a metrics gap:
curl -sS -o /dev/null -w '%{http_code}\n' https://grafana.safetywing.devMitigation#
- If external traffic is failing (curl to a known domain fails), treat as an edge outage — restart the DaemonSet:
kubectl rollout restart ds/traefik -n traefikand watch pods come Ready. - If pods are CrashLooping, read the logs — common causes are a bad static/dynamic config, a failed TLS cert resolver, or a hostNetwork port conflict on the node.
- If pods are Running and traffic flows but the target is down, the problem is scraping: confirm
--metrics.prometheusis enabled, the metrics port is exposed, and the ServiceMonitor/endpoints match. Fix the ServiceMonitor/labels rather than the ingress. - If only some nodes lost their Traefik pod (hostNetwork DaemonSet), check node readiness and taints — a NotReady node drops its edge pod.
- Root causes: Traefik crash from bad config, node failure removing DaemonSet pods, metrics flag/ServiceMonitor regression, port conflict on hostNetwork, or Prometheus-side scrape misconfig.
References#
- Traefik metrics (Prometheus): https://doc.traefik.io/traefik/observability/metrics/prometheus/
- Traefik on Kubernetes: https://doc.traefik.io/traefik/providers/kubernetes-ingress/
- Prometheus
absent()/up: https://prometheus.io/docs/prometheus/latest/querying/functions/#absent