<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>SafetyWing Runbooks</title><link>https://runbooks.safetywing.dev/</link><description>Recent content on SafetyWing Runbooks</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://runbooks.safetywing.dev/index.xml" rel="self" type="application/rss+xml"/><item><title>Alert Catalog</title><link>https://runbooks.safetywing.dev/runbooks/catalog/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/catalog/</guid><description>&lt;h1 id="alert-catalog"&gt;Alert Catalog&lt;a class="anchor" href="#alert-catalog"&gt;#&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Every alert evaluated across SafetyWing clusters — &lt;strong&gt;29 custom&lt;/strong&gt; (component / environment / platform tiers, owned by us) and &lt;strong&gt;133 stock&lt;/strong&gt; (kube-prometheus-stack defaults). Custom alerts link to the runbook on this site; stock alerts link to the upstream &lt;a href="https://runbooks.prometheus-operator.dev/"&gt;prometheus-operator runbooks&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote class='book-hint '&gt;
&lt;p&gt;Generated from the live hetzner rule set + the infra-charts/cluster-monitors sources. Stock alerts are identical across clusters; custom alerts deploy per environment/cluster where the chart is enabled.&lt;/p&gt;
&lt;/blockquote&gt;&lt;h2 id="safetywing-custom-alerts"&gt;SafetyWing custom alerts&lt;a class="anchor" href="#safetywing-custom-alerts"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="kafka--component-tier"&gt;Kafka &lt;small&gt;(component tier)&lt;/small&gt;&lt;a class="anchor" href="#kafka--component-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;KafkaOfflinePartitions&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/kafka/kafkaofflinepartitions/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;KafkaNoActiveController&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/kafka/kafkanoactivecontroller/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;KafkaUnderReplicatedPartitions&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/kafka/kafkaunderreplicatedpartitions/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;KafkaConsumerGroupLagHigh&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/kafka/kafkaconsumergrouplaghigh/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="kafka-connect--component-tier"&gt;Kafka Connect &lt;small&gt;(component tier)&lt;/small&gt;&lt;a class="anchor" href="#kafka-connect--component-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;KafkaConnectFailedTasks&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectfailedtasks/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;KafkaConnectWorkersDown&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectworkersdown/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;KafkaConnectNoConnectors&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectnoconnectors/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="mysql--component-tier"&gt;MySQL &lt;small&gt;(component tier)&lt;/small&gt;&lt;a class="anchor" href="#mysql--component-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;MysqlInstanceDown&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/mysql/mysqlinstancedown/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;MysqlConnectionsSaturated&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/mysql/mysqlconnectionssaturated/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;MysqlReplicationLagHigh&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/mysql/mysqlreplicationlaghigh/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;MysqlDiskFillingUp&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/mysql/mysqldiskfillingup/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="rabbitmq--component-tier"&gt;RabbitMQ &lt;small&gt;(component tier)&lt;/small&gt;&lt;a class="anchor" href="#rabbitmq--component-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;RabbitmqNodeDown&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqnodedown/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;RabbitmqMemoryAlarm&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqmemoryalarm/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;RabbitmqDiskAlarm&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdiskalarm/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;RabbitmqQueueBacklog&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuebacklog/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;RabbitmqQueueNoConsumers&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuenoconsumers/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="ceph--platform-tier"&gt;Ceph &lt;small&gt;(platform tier)&lt;/small&gt;&lt;a class="anchor" href="#ceph--platform-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;CephHealthError&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/ceph/cephhealtherror/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;CephMonOutOfQuorum&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/ceph/cephmonoutofquorum/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;CephHealthWarning&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/ceph/cephhealthwarning/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;CephOSDDown&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/ceph/cephosddown/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;CephClusterNearFull&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/ceph/cephclusternearfull/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="elasticsearch--platform-tier"&gt;Elasticsearch &lt;small&gt;(platform tier)&lt;/small&gt;&lt;a class="anchor" href="#elasticsearch--platform-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;ElasticsearchClusterRed&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusterred/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;ElasticsearchClusterYellow&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusteryellow/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;ElasticsearchHeapHigh&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchheaphigh/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;ElasticsearchDiskWatermark&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchdiskwatermark/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="node--platform-tier"&gt;Node &lt;small&gt;(platform tier)&lt;/small&gt;&lt;a class="anchor" href="#node--platform-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;NodeFilesystemAlmostFull&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/node/nodefilesystemalmostfull/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="traefik--platform-tier"&gt;Traefik &lt;small&gt;(platform tier)&lt;/small&gt;&lt;a class="anchor" href="#traefik--platform-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;TraefikDown&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;page&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/traefik/traefikdown/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;TraefikHigh5xxRate&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/traefik/traefikhigh5xxrate/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="environment--environment-tier"&gt;Environment &lt;small&gt;(environment tier)&lt;/small&gt;&lt;a class="anchor" href="#environment--environment-tier"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Alert&lt;/th&gt;
 &lt;th&gt;Severity&lt;/th&gt;
 &lt;th&gt;Runbook&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;EnvironmentHigh5xxRate&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;ticket&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;&lt;a href="https://runbooks.safetywing.dev/runbooks/environment/environmenthigh5xxrate/"&gt;runbook&lt;/a&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="stock-alerts-kube-prometheus-stack"&gt;Stock alerts (kube-prometheus-stack)&lt;a class="anchor" href="#stock-alerts-kube-prometheus-stack"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Shipped by the kube-prometheus-stack &lt;code&gt;defaultRules&lt;/code&gt;. Documented upstream — links go there.&lt;/p&gt;</description></item><item><title>CephClusterNearFull</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephclusternearfull/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephclusternearfull/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Raw capacity usage of the Rook-Ceph cluster has crossed the configured threshold. As Ceph approaches full it first throttles, then refuses writes — so this alert is a capacity early-warning that needs action before it becomes an outage.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ceph_cluster_total_used_bytes &lt;span style="color:#f92672"&gt;/&lt;/span&gt; ceph_cluster_total_bytes &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;ratio&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When Ceph hits its &lt;code&gt;nearfull&lt;/code&gt;/&lt;code&gt;backfillfull&lt;/code&gt;/&lt;code&gt;full&lt;/code&gt; ratios it degrades to &lt;code&gt;HEALTH_WARN&lt;/code&gt; then &lt;code&gt;HEALTH_ERR&lt;/code&gt;, and at the full ratio it &lt;strong&gt;blocks writes&lt;/strong&gt; and can force volumes read-only. Because Ceph backs PVCs across &lt;strong&gt;all environments&lt;/strong&gt; on hetzner, a full cluster is a multi-environment storage outage.&lt;/p&gt;</description></item><item><title>CephHealthError</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephhealtherror/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephhealtherror/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Rook-Ceph cluster is reporting &lt;code&gt;HEALTH_ERR&lt;/code&gt; — Ceph has detected one or more error-level conditions and storage is at risk. This is the most severe Ceph health state.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ceph_health_status &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph backs the PVCs for workloads across &lt;strong&gt;all environments&lt;/strong&gt; on the hetzner cluster. In &lt;code&gt;HEALTH_ERR&lt;/code&gt;, IO may stall, volumes can be forced read-only, and writes can be blocked. Treat as an active or imminent storage outage affecting every environment that depends on Ceph-backed storage.&lt;/p&gt;</description></item><item><title>CephHealthWarning</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephhealthwarning/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephhealthwarning/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Rook-Ceph cluster has been in &lt;code&gt;HEALTH_WARN&lt;/code&gt; for a sustained period. Ceph is functional but degraded — something needs attention before it escalates to &lt;code&gt;HEALTH_ERR&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ceph_health_status &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 30m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Usually no immediate outage — IO continues. But &lt;code&gt;HEALTH_WARN&lt;/code&gt; indicates reduced redundancy or headroom (degraded PGs, an OSD nearing full, a flapping mon, etc.) that affects storage backing PVCs across &lt;strong&gt;all environments&lt;/strong&gt;. Left unaddressed it can progress to &lt;code&gt;HEALTH_ERR&lt;/code&gt; and read-only/blocked writes.&lt;/p&gt;</description></item><item><title>CephMonOutOfQuorum</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephmonoutofquorum/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephmonoutofquorum/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;At least one Ceph monitor (mon) has dropped out of quorum. Mons maintain the cluster map and consensus; losing one reduces fault tolerance, and losing a majority halts the cluster.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;count&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;ceph_mon_quorum_status &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Mons are the control plane of Ceph. With one mon out of quorum the cluster still serves IO but has no redundancy margin; if quorum is lost entirely, all Ceph IO stops and PVCs across &lt;strong&gt;all environments&lt;/strong&gt; on hetzner become unavailable.&lt;/p&gt;</description></item><item><title>CephOSDDown</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephosddown/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephosddown/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;One or more Ceph OSDs (object storage daemons — the per-disk processes that store data) are marked &lt;code&gt;down&lt;/code&gt;. Each OSD maps to a physical disk on a hetzner node; a down OSD reduces redundancy and capacity.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;count&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;ceph_osd_up &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph keeps serving IO from surviving replicas, so usually no outage. But redundancy is reduced and recovery/backfill load increases. Multiple OSDs down (or a full failure domain) can cause degraded/unavailable PGs and put PVCs across &lt;strong&gt;all environments&lt;/strong&gt; at risk.&lt;/p&gt;</description></item><item><title>ElasticsearchClusterRed</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusterred/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusterred/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Elasticsearch cluster health is RED: at least one &lt;strong&gt;primary&lt;/strong&gt; shard is unassigned, so part of the index data is unavailable and writes to affected indices fail. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label, only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;elasticsearch_cluster_health_status{color&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;red&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Search and indexing for the affected indices is &lt;strong&gt;down&lt;/strong&gt;. This includes logging/observability data streams and application search indices (e.g. &lt;code&gt;sw_user&lt;/code&gt;, &lt;code&gt;sw_company&lt;/code&gt;, &lt;code&gt;sw_company_member&lt;/code&gt;). Any service that reads or writes those indices will see errors or empty results until the primaries are reassigned.&lt;/p&gt;</description></item><item><title>ElasticsearchClusterYellow</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusteryellow/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusteryellow/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Elasticsearch cluster health is YELLOW: all primary shards are assigned, but one or more &lt;strong&gt;replica&lt;/strong&gt; shards are unassigned. Data is fully available, but redundancy is reduced. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label, only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;elasticsearch_cluster_health_status{color&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;yellow&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 30m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;No outage. Reads and writes continue to work. The risk is reduced fault tolerance: if a node holding a primary now fails, the cluster could go RED because there is no replica to promote. Performance for read-heavy indices may also drop while replicas are missing.&lt;/p&gt;</description></item><item><title>ElasticsearchDiskWatermark</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchdiskwatermark/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchdiskwatermark/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Free disk space on at least one Elasticsearch data node has dropped below 15%, approaching the &lt;strong&gt;flood-stage&lt;/strong&gt; watermark (default 95% used). At flood stage Elasticsearch makes indices on the affected node &lt;strong&gt;read-only&lt;/strong&gt; to protect the disk. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label, only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;min&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;elasticsearch_filesystem_data_available_bytes &lt;span style="color:#f92672"&gt;/&lt;/span&gt; elasticsearch_filesystem_data_size_bytes&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0.15&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;As watermarks are crossed, ES stops allocating new shards to the node (high watermark, can cause YELLOW), and at flood stage applies the &lt;code&gt;index.blocks.read_only_allow_delete&lt;/code&gt; block — &lt;strong&gt;writes to affected indices fail&lt;/strong&gt; while reads continue. Logging/observability ingestion and application indices stop accepting new data until disk is freed and the block is cleared.&lt;/p&gt;</description></item><item><title>ElasticsearchHeapHigh</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchheaphigh/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchheaphigh/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;JVM heap usage on an Elasticsearch node has been above 90% of max heap for a sustained period. The &lt;code&gt;name&lt;/code&gt; label identifies the node. Persistent heap pressure causes frequent/long GC pauses, slow responses, and can destabilize or OOM the node. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label, only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;elasticsearch_jvm_memory_used_bytes{area&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;heap&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;/&lt;/span&gt; elasticsearch_jvm_memory_max_bytes{area&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;heap&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0.90&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Degraded performance across the cluster: increased query/index latency, GC stop-the-world pauses, and risk of the affected node dropping out (which can cascade to YELLOW/RED). No immediate data loss while the alert is just heap pressure.&lt;/p&gt;</description></item><item><title>EnvironmentHigh5xxRate</title><link>https://runbooks.safetywing.dev/runbooks/environment/environmenthigh5xxrate/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/environment/environmenthigh5xxrate/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The 5xx error ratio for a single environment exceeded its configured threshold. This is an environment-tier SLO alert emitted per environment by the &lt;code&gt;safetywing-environment&lt;/code&gt; chart, scoped to that environment&amp;rsquo;s application services.&lt;/p&gt;
&lt;p&gt;Fires when: the environment&amp;rsquo;s 5xx ratio crosses the threshold while traffic is above a minimum RPS floor (the &lt;code&gt;minRps&lt;/code&gt; guard avoids alerting on noise at low traffic).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# hetzner / Traefik form&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;sum&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;traefik_service_requests_total{service&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-applications-.*&lt;/span&gt;&amp;#34;,code&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;5..&lt;/span&gt;&amp;#34;}[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;sum&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;traefik_service_requests_total{service&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-applications-.*&lt;/span&gt;&amp;#34;}[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;))&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;ratio&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;and&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;sum&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;traefik_service_requests_total{service&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-applications-.*&lt;/span&gt;&amp;#34;}[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;))&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;minRps&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;On GKE environments the ingress is &lt;strong&gt;nginx&lt;/strong&gt;, so the source metric differs:&lt;/p&gt;</description></item><item><title>KafkaConnectFailedTasks</title><link>https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectfailedtasks/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectfailedtasks/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;One or more Debezium CDC connector tasks have entered the &lt;code&gt;FAILED&lt;/code&gt; state, so change capture for the affected connector is degraded or fully stopped. Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_connect_worker_metrics_connector_failed_task_count{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;CDC from MOCO MySQL into Kafka is stalled for the failed connector. Downstream consumers stop receiving database changes: search indices fall behind, derived/mirror tables go stale, and any event-driven flow fed by these topics no longer reflects new writes. Lag grows until the task is recovered.&lt;/p&gt;</description></item><item><title>KafkaConnectNoConnectors</title><link>https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectnoconnectors/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectnoconnectors/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Kafka Connect cluster is running but reports zero connectors, meaning no Debezium CDC source connector is deployed or running — CDC for the environment may be unconfigured or all connectors were removed. Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_connect_worker_metrics_connector_count{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;No change capture is happening at all in this environment: no MySQL changes flow from MOCO MySQL into Kafka. Downstream consumers (search indices, mirror/derived tables, event-driven flows) receive nothing new. For a freshly provisioned env this may be expected during bring-up; for an established env it means CDC is silently broken.&lt;/p&gt;</description></item><item><title>KafkaConnectWorkersDown</title><link>https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectworkersdown/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka-connect/kafkaconnectworkersdown/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Fewer Kafka Connect workers are reporting metrics than the number of replicas the &lt;code&gt;kafka-cdc&lt;/code&gt; chart expects, indicating one or more Connect pods are down, crash-looping, or not scraping. Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;count&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_connect_worker_metrics_connector_count{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;connect&lt;span style="color:#960050;background-color:#1e0010"&gt;.&lt;/span&gt;replicas&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Reduced Connect capacity and resilience for the CDC pipeline. Tasks owned by the missing worker are rebalanced onto survivors (added load, possible throughput drop and lag); if the cluster is at one replica, CDC from MOCO MySQL into Kafka is fully stopped and downstream consumers (search indices, mirror tables, event flows) stop receiving DB changes.&lt;/p&gt;</description></item><item><title>KafkaConsumerGroupLagHigh</title><link>https://runbooks.safetywing.dev/runbooks/kafka/kafkaconsumergrouplaghigh/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka/kafkaconsumergrouplaghigh/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A consumer group is falling behind the producers on a topic — the gap between the latest offset and the group&amp;rsquo;s committed offset (lag) has exceeded the threshold. Messages are being produced faster than they are consumed, so processing is delayed. The &lt;code&gt;consumergroup&lt;/code&gt; and &lt;code&gt;topic&lt;/code&gt; labels identify exactly which consumer and topic are affected.&lt;/p&gt;
&lt;p&gt;Fires when: per-(consumergroup, topic) lag exceeds &lt;code&gt;&amp;lt;threshold&amp;gt;&lt;/code&gt; for 15m. Severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;consumergroup, topic&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_consumergroup_lag{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;threshold&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Delayed processing for the affected consumer group → stale downstream data, late side effects, growing end-to-end latency.&lt;/li&gt;
&lt;li&gt;If lag keeps climbing, retention may expire un-consumed messages, causing permanent message loss.&lt;/li&gt;
&lt;li&gt;Brokers retain more unconsumed data, increasing disk usage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get kafka,kafkanodepool -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Describe the lagging group: per-partition LAG, CURRENT-OFFSET, LOG-END-OFFSET, CONSUMER-ID&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --group &amp;lt;consumergroup&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Are there active members, or is the group empty / rebalancing?&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --group &amp;lt;consumergroup&amp;gt; --members --verbose
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Inspect the consuming application workload&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -A | grep &amp;lt;consumer-app&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n &amp;lt;consumer-ns&amp;gt; &amp;lt;consumer-pod&amp;gt; --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Confirm trend and scope in Prometheus (&lt;code&gt;prom-ep.hetzner.safetywing.dev&lt;/code&gt;):&lt;/p&gt;</description></item><item><title>KafkaNoActiveController</title><link>https://runbooks.safetywing.dev/runbooks/kafka/kafkanoactivecontroller/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka/kafkanoactivecontroller/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In KRaft mode exactly one controller node should be active (the metadata quorum leader). This alert means the cluster sees zero active controllers (no quorum leader) or more than one (split brain). Either state puts cluster metadata — topic, partition, ISR, and config state — at risk and blocks administrative operations.&lt;/p&gt;
&lt;p&gt;Fires when: the summed active controller count across the namespace is not exactly 1 for 5m. Severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>KafkaOfflinePartitions</title><link>https://runbooks.safetywing.dev/runbooks/kafka/kafkaofflinepartitions/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka/kafkaofflinepartitions/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;One or more partitions have no leader, so they cannot serve reads or writes. Any producer or consumer touching an offline partition is blocked, which usually means data loss risk and stalled traffic across affected topics.&lt;/p&gt;
&lt;p&gt;Fires when: any broker reports a non-zero offline partition count for 5m. Severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_controller_kafkacontroller_offlinepartitionscount{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Produce and consume requests to offline partitions fail or hang.&lt;/li&gt;
&lt;li&gt;Consumer groups stall on the affected partitions; lag grows.&lt;/li&gt;
&lt;li&gt;Topics with offline partitions are effectively partially unavailable.&lt;/li&gt;
&lt;li&gt;Often a symptom of multiple broker failures or unavailable replicas.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Strimzi CRs and broker pods&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get kafka,kafkanodepool -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra -l strimzi.io/cluster
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Which brokers are not Ready&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra -o wide | grep -v Running
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Broker / controller logs (look for leader election, ISR, disk errors)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Cluster + topic state from inside a broker&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-topics.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --under-min-isr-partitions
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-topics.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --unavailable-partitions&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Confirm scope in Prometheus (&lt;code&gt;prom-ep.hetzner.safetywing.dev&lt;/code&gt;):&lt;/p&gt;</description></item><item><title>KafkaUnderReplicatedPartitions</title><link>https://runbooks.safetywing.dev/runbooks/kafka/kafkaunderreplicatedpartitions/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka/kafkaunderreplicatedpartitions/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Partitions have fewer in-sync replicas than their configured replication factor. The cluster is still serving traffic, but durability is reduced — losing one more broker could take partitions offline or lose data. Usually a broker is down, restarting, or lagging behind on replication.&lt;/p&gt;
&lt;p&gt;Fires when: any broker reports a non-zero under-replicated partition count for 10m. Severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_server_replicamanager_underreplicatedpartitions{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Reduced fault tolerance: a single additional broker failure may cause offline partitions or data loss.&lt;/li&gt;
&lt;li&gt;Producers using &lt;code&gt;acks=all&lt;/code&gt; may slow down or block if the ISR drops below &lt;code&gt;min.insync.replicas&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Sustained under-replication often precedes an &lt;code&gt;OfflinePartitions&lt;/code&gt; page.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Strimzi CRs and broker pods&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get kafka,kafkanodepool -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra -l strimzi.io/cluster -o wide
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Any broker not Ready / restarting?&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra | grep -vE &lt;span style="color:#e6db74"&gt;&amp;#34;Running|Completed&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Which partitions are under-replicated / below min ISR&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-topics.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --under-replicated-partitions
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-topics.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --under-min-isr-partitions
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Broker logs: replica fetcher, ISR shrink/expand, disk&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt; | grep -iE &lt;span style="color:#e6db74"&gt;&amp;#34;ISR|replica|fetch&amp;#34;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Confirm scope in Prometheus (&lt;code&gt;prom-ep.hetzner.safetywing.dev&lt;/code&gt;):&lt;/p&gt;</description></item><item><title>MysqlConnectionsSaturated</title><link>https://runbooks.safetywing.dev/runbooks/mysql/mysqlconnectionssaturated/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/mysql/mysqlconnectionssaturated/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The number of connected threads on a MySQL instance is approaching &lt;code&gt;max_connections&lt;/code&gt;. New connections risk being refused with &lt;code&gt;ER_CON_COUNT_ERROR&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;pod&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; mysql_global_status_threads_connected{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; mysql_global_variables_max_connections{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;ratio&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Once &lt;code&gt;max_connections&lt;/code&gt; is hit, new clients get &amp;ldquo;Too many connections&amp;rdquo; and application requests fail.&lt;/li&gt;
&lt;li&gt;Often a symptom of leaked/unclosed connections, an oversized client pool, or slow queries holding connections open.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get mysqlcluster -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco status -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;cluster&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Open a mysql shell to the primary and inspect live connections&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco mysql -n safetywing-&amp;lt;env&amp;gt;-infra -u moco-admin &amp;lt;cluster&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -e &lt;span style="color:#e6db74"&gt;&amp;#34;SHOW PROCESSLIST;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco mysql -n safetywing-&amp;lt;env&amp;gt;-infra -u moco-admin &amp;lt;cluster&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -e &lt;span style="color:#e6db74"&gt;&amp;#34;SHOW STATUS LIKE &amp;#39;Threads_connected&amp;#39;; SHOW VARIABLES LIKE &amp;#39;max_connections&amp;#39;;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Group connections by host/user to find the offender&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco mysql -n safetywing-&amp;lt;env&amp;gt;-infra -u moco-admin &amp;lt;cluster&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -e &lt;span style="color:#e6db74"&gt;&amp;#34;SELECT user, host, count(*) FROM information_schema.processlist GROUP BY user, host ORDER BY 3 DESC;&amp;#34;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Current usage ratio per pod&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;pod&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; mysql_global_status_threads_connected{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; mysql_global_variables_max_connections{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="mitigation"&gt;Mitigation&lt;a class="anchor" href="#mitigation"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Identify the offending client(s) from &lt;code&gt;SHOW PROCESSLIST&lt;/code&gt; / the grouped query above — usually one service with a misconfigured pool or leaked connections.&lt;/li&gt;
&lt;li&gt;Fix at the source: scale down the offending workload, tune its connection pool max size, or restart it to drop leaked connections.&lt;/li&gt;
&lt;li&gt;Kill stuck/sleeping connections if needed:
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-sql" data-lang="sql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;KILL &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;process_id&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;li&gt;If demand is legitimate, raise &lt;code&gt;max_connections&lt;/code&gt; in the &lt;code&gt;MySQLCluster&lt;/code&gt; spec (MOCO reconciles it into the instances&amp;rsquo; &lt;code&gt;my.cnf&lt;/code&gt;):
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;mysqlConfigMapName&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;&amp;lt;name&amp;gt; &lt;/span&gt; &lt;span style="color:#75715e"&gt;# or set under spec.podTemplate config&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# in the referenced ConfigMap:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# max_connections = &amp;lt;n&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;Ensure the instance has memory headroom — each connection consumes per-thread buffers.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="references"&gt;References&lt;a class="anchor" href="#references"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cybozu-go.github.io/moco/"&gt;MOCO docs (cybozu-go/moco)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/too-many-connections.html"&gt;MySQL: Too many connections&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_max_connections"&gt;MySQL: max_connections&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>MysqlDiskFillingUp</title><link>https://runbooks.safetywing.dev/runbooks/mysql/mysqldiskfillingup/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/mysql/mysqldiskfillingup/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A &lt;code&gt;mysql-data-*&lt;/code&gt; PersistentVolumeClaim is running low on free space. If it fills completely, mysqld will fail writes and may refuse to start.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;min&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;persistentvolumeclaim&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; kubelet_volume_stats_available_bytes{persistentvolumeclaim&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;mysql-data-.*&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; kubelet_volume_stats_capacity_bytes{persistentvolumeclaim&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;mysql-data-.*&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt; &lt;span style="color:#f92672"&gt;-&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;ratio&lt;span style="color:#f92672"&gt;&amp;gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;A full data volume causes write errors and can crash the instance (&lt;a href="MysqlInstanceDown.md"&gt;MysqlInstanceDown&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Common culprits: accumulated binary logs, an oversized dataset, or relay logs/temp files on a lagging replica.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get mysqlcluster -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pvc -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Inspect on-disk usage from inside the mysqld container&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;pod&amp;gt; -c mysqld -- df -h /var/lib/mysql
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;pod&amp;gt; -c mysqld -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; sh -c &lt;span style="color:#e6db74"&gt;&amp;#39;du -sh /var/lib/mysql/* | sort -h | tail -20&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Binary log inventory&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco mysql -n safetywing-&amp;lt;env&amp;gt;-infra -u moco-admin &amp;lt;cluster&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -e &lt;span style="color:#e6db74"&gt;&amp;#34;SHOW BINARY LOGS;&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Largest tables&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco mysql -n safetywing-&amp;lt;env&amp;gt;-infra -u moco-admin &amp;lt;cluster&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -e &lt;span style="color:#e6db74"&gt;&amp;#34;SELECT table_schema, table_name, ROUND((data_length+index_length)/1024/1024) AS mb FROM information_schema.tables ORDER BY mb DESC LIMIT 20;&amp;#34;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Fraction free per PVC&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;min&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;persistentvolumeclaim&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; kubelet_volume_stats_available_bytes{persistentvolumeclaim&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;mysql-data-.*&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; kubelet_volume_stats_capacity_bytes{persistentvolumeclaim&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;mysql-data-.*&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="mitigation"&gt;Mitigation&lt;a class="anchor" href="#mitigation"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Expand the PVC&lt;/strong&gt; (preferred — the StorageClass supports volume expansion). Increase the volume request in the &lt;code&gt;MySQLCluster&lt;/code&gt; &lt;code&gt;volumeClaimTemplates&lt;/code&gt;; MOCO/Kubernetes resizes the PVC online:
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;volumeClaimTemplates&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; - &lt;span style="color:#f92672"&gt;metadata&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;name&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;mysql-data&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;spec&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;resources&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;requests&lt;/span&gt;:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;storage&lt;/span&gt;: &lt;span style="color:#ae81ff"&gt;&amp;lt;larger-size&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;Apply via GitOps, then confirm with &lt;code&gt;kubectl get pvc -n safetywing-&amp;lt;env&amp;gt;-infra&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prune binary logs&lt;/strong&gt; if they dominate usage (only purge logs already applied by all replicas):
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-sql" data-lang="sql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;PURGE BINARY LOGS &lt;span style="color:#66d9ef"&gt;BEFORE&lt;/span&gt; NOW() &lt;span style="color:#f92672"&gt;-&lt;/span&gt; INTERVAL &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;DAY&lt;/span&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;For a durable fix, tune &lt;code&gt;binlog_expire_logs_seconds&lt;/code&gt; in the cluster MySQL config.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replica behind&lt;/strong&gt;: a lagging replica accumulates relay logs — clearing the lag (&lt;a href="MysqlReplicationLagHigh.md"&gt;MysqlReplicationLagHigh&lt;/a&gt;) lets them be purged.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reclaim table space&lt;/strong&gt;: drop unused data or run &lt;code&gt;OPTIMIZE TABLE&lt;/code&gt; on bloated tables (note: requires temporary extra space, so resize first if very full).&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="references"&gt;References&lt;a class="anchor" href="#references"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cybozu-go.github.io/moco/"&gt;MOCO docs (cybozu-go/moco)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cybozu-go.github.io/moco/usage.html"&gt;MOCO: volumeClaimTemplates / volume expansion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/purge-binary-logs.html"&gt;MySQL: PURGE BINARY LOGS&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>MysqlInstanceDown</title><link>https://runbooks.safetywing.dev/runbooks/mysql/mysqlinstancedown/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/mysql/mysqlinstancedown/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A MySQL instance&amp;rsquo;s &lt;code&gt;mysqld_exporter&lt;/code&gt; sidecar reports the server unreachable, so the mysqld process is down or not accepting connections.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;pod&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;mysql_up{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The affected instance serves no queries.&lt;/li&gt;
&lt;li&gt;If the primary is down, MOCO must fail over before writes can resume; expect a short write outage.&lt;/li&gt;
&lt;li&gt;If a replica is down, read capacity and replication redundancy are reduced.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Cluster + member roles (which pod is primary vs replica)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get mysqlcluster -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco status -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;cluster&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Pod state and recent events (OOMKill, evictions, probe failures)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra -l app.kubernetes.io/name&lt;span style="color:#f92672"&gt;=&lt;/span&gt;mysql -o wide
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl describe pod -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;pod&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Container logs — mysqld, the MOCO agent, and the exporter&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;pod&amp;gt; -c mysqld --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;pod&amp;gt; -c agent --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;pod&amp;gt; -c mysqld-exporter --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;100&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Confirm which pod(s) are down&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;pod&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;mysql_up{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="mitigation"&gt;Mitigation&lt;a class="anchor" href="#mitigation"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Check pod events for the root cause: OOMKilled (raise memory limits in the &lt;code&gt;MySQLCluster&lt;/code&gt; &lt;code&gt;.spec.podTemplate&lt;/code&gt;), node pressure/eviction, or failed PVC mount.&lt;/li&gt;
&lt;li&gt;If disk is full, mysqld will refuse to start — see &lt;a href="MysqlDiskFillingUp.md"&gt;MysqlDiskFillingUp&lt;/a&gt; and expand the PVC first.&lt;/li&gt;
&lt;li&gt;If the process crashed but the pod is healthy, restart it:
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl delete pod -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;pod&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;MOCO recreates the pod; verify it rejoins via &lt;code&gt;kubectl moco status&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If the &lt;strong&gt;primary&lt;/strong&gt; is down and not recovering, let MOCO fail over to a healthy replica; confirm a new primary was elected in &lt;code&gt;kubectl moco status&lt;/code&gt;. Investigate the old primary before reintroducing it.&lt;/li&gt;
&lt;li&gt;If MOCO cannot reconcile, inspect the operator:
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n moco-system deploy/moco-controller-manager --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="references"&gt;References&lt;a class="anchor" href="#references"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cybozu-go.github.io/moco/"&gt;MOCO docs (cybozu-go/moco)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.mysql.com/doc/refman/8.0/en/server-shutdown.html"&gt;MySQL: Server shutdown / startup&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>MysqlReplicationLagHigh</title><link>https://runbooks.safetywing.dev/runbooks/mysql/mysqlreplicationlaghigh/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/mysql/mysqlreplicationlaghigh/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A MySQL replica is applying the primary&amp;rsquo;s binlog stream slower than it is produced, so its &lt;code&gt;Seconds_Behind_Master&lt;/code&gt; has exceeded the threshold.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;pod&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; mysql_slave_status_seconds_behind_master{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;seconds&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Reads served by the lagging replica return stale data.&lt;/li&gt;
&lt;li&gt;A failover to a lagging replica could lose recent writes (MOCO uses semi-sync, which bounds but does not eliminate this risk under degraded conditions).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get mysqlcluster -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco status -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;cluster&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Replication status on the lagging replica&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco mysql -n safetywing-&amp;lt;env&amp;gt;-infra -u moco-admin --index &amp;lt;n&amp;gt; &amp;lt;cluster&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -e &lt;span style="color:#e6db74"&gt;&amp;#34;SHOW REPLICA STATUS\G&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Container logs of the replica (mysqld + agent)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;replica-pod&amp;gt; -c mysqld --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;replica-pod&amp;gt; -c agent --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Long-running transactions on the primary that bloat the binlog&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl moco mysql -n safetywing-&amp;lt;env&amp;gt;-infra -u moco-admin &amp;lt;cluster&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; -e &lt;span style="color:#e6db74"&gt;&amp;#34;SELECT * FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 10;&amp;#34;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Lag per pod&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;pod&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;mysql_slave_status_seconds_behind_master{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;In &lt;code&gt;SHOW REPLICA STATUS&lt;/code&gt;, check &lt;code&gt;Replica_IO_Running&lt;/code&gt; / &lt;code&gt;Replica_SQL_Running&lt;/code&gt; (both should be &lt;code&gt;Yes&lt;/code&gt;), &lt;code&gt;Last_Error&lt;/code&gt;, and &lt;code&gt;Seconds_Behind_Master&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>NodeFilesystemAlmostFull</title><link>https://runbooks.safetywing.dev/runbooks/node/nodefilesystemalmostfull/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/node/nodefilesystemalmostfull/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A node filesystem is running low on free space. This is a supplemental platform-tier rule on top of the kube-prometheus-stack node-exporter mixin.&lt;/p&gt;
&lt;p&gt;Fires when: available space on a non-ephemeral filesystem drops below the configured ratio.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;node_filesystem_avail_bytes{fstype&lt;span style="color:#f92672"&gt;!~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;tmpfs|overlay|squashfs&lt;/span&gt;&amp;#34;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; node_filesystem_size_bytes{fstype&lt;span style="color:#f92672"&gt;!~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;tmpfs|overlay|squashfs&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;ratio&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt; (cluster-wide, no &lt;code&gt;environment&lt;/code&gt; label). The offending filesystem is identified by the &lt;code&gt;instance&lt;/code&gt; and &lt;code&gt;mountpoint&lt;/code&gt; labels.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A full filesystem on a node can wedge the kubelet, fail image pulls, block container log writes, evict pods (ephemeral-storage pressure), and on Talos can disrupt the system partition. If the node hosts stateful workloads (Ceph OSDs, MySQL/MOCO, RabbitMQ), data writes can stall or fail.&lt;/p&gt;</description></item><item><title>RabbitmqDeadLetterMessages</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdeadlettermessages/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdeadlettermessages/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A dead-letter queue (DLQ) holds one or more ready messages. DLQs are the topology chart&amp;rsquo;s &lt;code&gt;{namespace}.deadletter&lt;/code&gt; queues — under normal operation they are empty.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;queue&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_queue_messages_ready{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;, queue&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;.+[.]deadletter&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;. The &lt;code&gt;queue&lt;/code&gt; label identifies the affected DLQ (always ends in &lt;code&gt;.deadletter&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Each &lt;code&gt;rabbitmq-topology&lt;/code&gt; namespace wires a three-queue retry flow: the main queue &lt;code&gt;{ns}&lt;/code&gt; dead-letters failed messages to &lt;code&gt;{ns}.retry&lt;/code&gt; (which holds them for &lt;code&gt;retryDelay&lt;/code&gt;, default 10 min, then re-publishes to the main queue). A message only lands in the &lt;strong&gt;dead-letter queue&lt;/strong&gt; &lt;code&gt;{ns}.deadletter&lt;/code&gt; when it is dead-lettered with the &lt;code&gt;deadletter&lt;/code&gt; routing key — i.e. it has &lt;strong&gt;exhausted its retries&lt;/strong&gt; or was explicitly rejected as terminally unprocessable. So a non-empty DLQ means &amp;ldquo;messages a consumer gave up on&amp;rdquo;, not transient backpressure.&lt;/p&gt;</description></item><item><title>RabbitmqDiskAlarm</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdiskalarm/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdiskalarm/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A RabbitMQ node has dropped below its free-disk-space watermark and raised a disk alarm. RabbitMQ blocks publishers to avoid filling the disk and corrupting the message store.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_alarms_free_disk_space_watermark{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Publishing is blocked cluster-wide.&lt;/strong&gt; As with the memory alarm, once any node trips the free-disk watermark RabbitMQ blocks all publishing connections until free space recovers. Consumers continue, but producers hang and dependent backend services back up. If the disk fills completely the node can crash and lose durability guarantees, so this must be cleared promptly.&lt;/p&gt;</description></item><item><title>RabbitmqMemoryAlarm</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqmemoryalarm/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqmemoryalarm/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A RabbitMQ node has crossed its memory high-watermark and raised a memory alarm. RabbitMQ responds by blocking all publishers across the cluster to protect the broker from running out of memory.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_alarms_memory_used_watermark{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Publishing is blocked cluster-wide.&lt;/strong&gt; Once any node hits the memory watermark, RabbitMQ throttles/blocks all connections that are publishing, so producers across every queue hang. Consumers keep draining, but new messages cannot be accepted. This typically surfaces as backend services timing out on publish and growing request latency until memory is reclaimed.&lt;/p&gt;</description></item><item><title>RabbitmqNodeDown</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqnodedown/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqnodedown/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Fewer RabbitMQ nodes are reporting metrics than the configured number of replicas, meaning one or more cluster members are down or unreachable.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;count&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_build_info{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;replicas&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A missing node reduces capacity and redundancy. With quorum queues, losing a node erodes the quorum margin; losing a majority makes those queues unavailable for reads and writes. Classic mirrored/single-node queues hosted on the down node become unavailable until it returns. Sustained node loss risks a full cluster outage.&lt;/p&gt;</description></item><item><title>RabbitmqQueueBacklog</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuebacklog/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuebacklog/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A queue has accumulated a large number of ready (undelivered) messages, meaning consumers are not keeping up with producers.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;queue&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_queue_messages_ready{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;threshold&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;. The &lt;code&gt;queue&lt;/code&gt; label identifies the affected queue.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Messages are being produced faster than they are consumed. Downstream processing is delayed, so whatever the queue feeds (notifications, payments, sync jobs, etc.) lags behind. A persistently growing backlog also consumes memory and disk and can eventually trip the &lt;a href="RabbitmqMemoryAlarm"&gt;memory&lt;/a&gt; or &lt;a href="RabbitmqDiskAlarm"&gt;disk&lt;/a&gt; alarms and block publishing cluster-wide.&lt;/p&gt;</description></item><item><title>RabbitmqQueueNoConsumers</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuenoconsumers/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuenoconsumers/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A queue has ready messages but zero consumers attached, so nothing is draining it. Messages will sit indefinitely until a consumer connects.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;queue&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_queue_messages_ready{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;and&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;queue&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_queue_consumers{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;. The &lt;code&gt;queue&lt;/code&gt; label identifies the affected queue.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Work enqueued on this queue is not being processed at all. Unlike a slow backlog, there is no progress whatsoever, so the dependent feature is effectively down. The backlog will keep growing and can eventually trip the &lt;a href="RabbitmqMemoryAlarm"&gt;memory&lt;/a&gt; or &lt;a href="RabbitmqDiskAlarm"&gt;disk&lt;/a&gt; alarms and block publishing cluster-wide.&lt;/p&gt;</description></item><item><title>TraefikDown</title><link>https://runbooks.safetywing.dev/runbooks/traefik/traefikdown/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/traefik/traefikdown/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Prometheus has no healthy Traefik scrape target. Either the Traefik edge ingress is down, or just its metrics endpoint/scraping is broken.&lt;/p&gt;
&lt;p&gt;Fires when: there is no &lt;code&gt;up == 1&lt;/code&gt; series for any Traefik scrape job.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;absent&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;up{job&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;.*traefik.*&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt; (cluster-wide, no &lt;code&gt;environment&lt;/code&gt; label).&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Traefik is the edge ingress (a hostNetwork DaemonSet on the Hetzner cluster). If Traefik itself is down, all external HTTP(S) traffic into the cluster fails — every public domain behind it. If only metrics are broken, traffic may still flow but we are blind to edge health and 5xx alerting is degraded.&lt;/p&gt;</description></item><item><title>TraefikHigh5xxRate</title><link>https://runbooks.safetywing.dev/runbooks/traefik/traefikhigh5xxrate/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/traefik/traefikhigh5xxrate/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The cluster-wide ratio of 5xx responses served by Traefik exceeds 5%. This is an edge-level signal aggregating all services behind Traefik.&lt;/p&gt;
&lt;p&gt;Fires when: 5xx requests are more than 5% of all Traefik service requests.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;sum&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;traefik_service_requests_total{code&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;5..&lt;/span&gt;&amp;#34;}[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;/&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;sum&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;&lt;span style="color:#66d9ef"&gt;rate&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;traefik_service_requests_total[&lt;span style="color:#e6db74"&gt;5m&lt;/span&gt;]&lt;span style="color:#f92672"&gt;))&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0.05&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt; (cluster-wide, no &lt;code&gt;environment&lt;/code&gt; label).&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A meaningful fraction of requests entering through the edge are failing with server errors. Because this is cluster-wide, it usually points at either a broadly-impacting backend (a shared dependency) or one high-traffic service skewing the aggregate. Users across one or more environments see errors.&lt;/p&gt;</description></item></channel></rss>