<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Traefik on SafetyWing Runbooks</title><link>https://runbooks.safetywing.dev/runbooks/traefik/</link><description>Recent content in Traefik on SafetyWing Runbooks</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://runbooks.safetywing.dev/runbooks/traefik/index.xml" rel="self" type="application/rss+xml"/><item><title>TraefikDown</title><link>https://runbooks.safetywing.dev/runbooks/traefik/traefikdown/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/traefik/traefikdown/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Prometheus has no healthy Traefik scrape target. Either the Traefik edge ingress is down, or just its metrics endpoint/scraping is broken.&lt;/p&gt;
&lt;p&gt;Fires when: there is no &lt;code&gt;up == 1&lt;/code&gt; series for any Traefik scrape job.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;absent&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;up{job&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;.*traefik.*&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt; (cluster-wide, no &lt;code&gt;environment&lt;/code&gt; label).&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Traefik is the edge ingress (a hostNetwork DaemonSet on the Hetzner cluster). If Traefik itself is down, all external HTTP(S) traffic into the cluster fails — every public domain behind it. If only metrics are broken, traffic may still flow but we are blind to edge health and 5xx alerting is degraded.&lt;/p&gt;</description></item><item><title>TraefikHigh5xxRate</title><link>https://runbooks.safetywing.dev/runbooks/traefik/traefikhigh5xxrate/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/traefik/traefikhigh5xxrate/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The cluster-wide ratio of 5xx responses served by Traefik exceeds 5%. This is an edge-level signal aggregating all services behind Traefik.&lt;/p&gt;
&lt;p&gt;Fires when: 5xx requests are more than 5% of all Traefik service requests.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;sum&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;traefik_service_requests_total{code&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;5..&lt;/span&gt;&amp;#34;}[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;sum&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;traefik_service_requests_total[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;))&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0.05&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt; (cluster-wide, no &lt;code&gt;environment&lt;/code&gt; label).&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A meaningful fraction of requests entering through the edge are failing with server errors. Because this is cluster-wide, it usually points at either a broadly-impacting backend (a shared dependency) or one high-traffic service skewing the aggregate. Users across one or more environments see errors.&lt;/p&gt;</description></item></channel></rss>