<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Kafka on SafetyWing Runbooks</title><link>https://runbooks.safetywing.dev/runbooks/kafka/</link><description>Recent content in Kafka on SafetyWing Runbooks</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://runbooks.safetywing.dev/runbooks/kafka/index.xml" rel="self" type="application/rss+xml"/><item><title>KafkaConsumerGroupLagHigh</title><link>https://runbooks.safetywing.dev/runbooks/kafka/kafkaconsumergrouplaghigh/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka/kafkaconsumergrouplaghigh/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A consumer group is falling behind the producers on a topic — the gap between the latest offset and the group&amp;rsquo;s committed offset (lag) has exceeded the threshold. Messages are being produced faster than they are consumed, so processing is delayed. The &lt;code&gt;consumergroup&lt;/code&gt; and &lt;code&gt;topic&lt;/code&gt; labels identify exactly which consumer and topic are affected.&lt;/p&gt;
&lt;p&gt;Fires when: per-(consumergroup, topic) lag exceeds &lt;code&gt;&amp;lt;threshold&amp;gt;&lt;/code&gt; for 15m. Severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;consumergroup, topic&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_consumergroup_lag{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;threshold&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Delayed processing for the affected consumer group → stale downstream data, late side effects, growing end-to-end latency.&lt;/li&gt;
&lt;li&gt;If lag keeps climbing, retention may expire un-consumed messages, causing permanent message loss.&lt;/li&gt;
&lt;li&gt;Brokers retain more unconsumed data, increasing disk usage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get kafka,kafkanodepool -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Describe the lagging group: per-partition LAG, CURRENT-OFFSET, LOG-END-OFFSET, CONSUMER-ID&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --group &amp;lt;consumergroup&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Are there active members, or is the group empty / rebalancing?&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --group &amp;lt;consumergroup&amp;gt; --members --verbose
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Inspect the consuming application workload&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -A | grep &amp;lt;consumer-app&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n &amp;lt;consumer-ns&amp;gt; &amp;lt;consumer-pod&amp;gt; --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Confirm trend and scope in Prometheus (&lt;code&gt;prom-ep.hetzner.safetywing.dev&lt;/code&gt;):&lt;/p&gt;</description></item><item><title>KafkaNoActiveController</title><link>https://runbooks.safetywing.dev/runbooks/kafka/kafkanoactivecontroller/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka/kafkanoactivecontroller/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;In KRaft mode exactly one controller node should be active (the metadata quorum leader). This alert means the cluster sees zero active controllers (no quorum leader) or more than one (split brain). Either state puts cluster metadata — topic, partition, ISR, and config state — at risk and blocks administrative operations.&lt;/p&gt;
&lt;p&gt;Fires when: the summed active controller count across the namespace is not exactly 1 for 5m. Severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>KafkaOfflinePartitions</title><link>https://runbooks.safetywing.dev/runbooks/kafka/kafkaofflinepartitions/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka/kafkaofflinepartitions/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;One or more partitions have no leader, so they cannot serve reads or writes. Any producer or consumer touching an offline partition is blocked, which usually means data loss risk and stalled traffic across affected topics.&lt;/p&gt;
&lt;p&gt;Fires when: any broker reports a non-zero offline partition count for 5m. Severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_controller_kafkacontroller_offlinepartitionscount{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Produce and consume requests to offline partitions fail or hang.&lt;/li&gt;
&lt;li&gt;Consumer groups stall on the affected partitions; lag grows.&lt;/li&gt;
&lt;li&gt;Topics with offline partitions are effectively partially unavailable.&lt;/li&gt;
&lt;li&gt;Often a symptom of multiple broker failures or unavailable replicas.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Strimzi CRs and broker pods&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get kafka,kafkanodepool -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra -l strimzi.io/cluster
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Which brokers are not Ready&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra -o wide | grep -v Running
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Broker / controller logs (look for leader election, ISR, disk errors)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Cluster + topic state from inside a broker&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-topics.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --under-min-isr-partitions
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-topics.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --unavailable-partitions&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Confirm scope in Prometheus (&lt;code&gt;prom-ep.hetzner.safetywing.dev&lt;/code&gt;):&lt;/p&gt;</description></item><item><title>KafkaUnderReplicatedPartitions</title><link>https://runbooks.safetywing.dev/runbooks/kafka/kafkaunderreplicatedpartitions/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/kafka/kafkaunderreplicatedpartitions/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Partitions have fewer in-sync replicas than their configured replication factor. The cluster is still serving traffic, but durability is reduced — losing one more broker could take partitions offline or lose data. Usually a broker is down, restarting, or lagging behind on replication.&lt;/p&gt;
&lt;p&gt;Fires when: any broker reports a non-zero under-replicated partition count for 10m. Severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;kafka_server_replicamanager_underreplicatedpartitions{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Reduced fault tolerance: a single additional broker failure may cause offline partitions or data loss.&lt;/li&gt;
&lt;li&gt;Producers using &lt;code&gt;acks=all&lt;/code&gt; may slow down or block if the ISR drops below &lt;code&gt;min.insync.replicas&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Sustained under-replication often precedes an &lt;code&gt;OfflinePartitions&lt;/code&gt; page.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="diagnosis"&gt;Diagnosis&lt;a class="anchor" href="#diagnosis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl config use-context hetzner
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Strimzi CRs and broker pods&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get kafka,kafkanodepool -n safetywing-&amp;lt;env&amp;gt;-infra
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra -l strimzi.io/cluster -o wide
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Any broker not Ready / restarting?&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl get pods -n safetywing-&amp;lt;env&amp;gt;-infra | grep -vE &lt;span style="color:#e6db74"&gt;&amp;#34;Running|Completed&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Which partitions are under-replicated / below min ISR&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-topics.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --under-replicated-partitions
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl exec -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; -- &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bin/kafka-topics.sh --bootstrap-server localhost:9092 &lt;span style="color:#ae81ff"&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; --describe --under-min-isr-partitions
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;# Broker logs: replica fetcher, ISR shrink/expand, disk&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;kubectl logs -n safetywing-&amp;lt;env&amp;gt;-infra &amp;lt;broker-pod&amp;gt; --tail&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;200&lt;/span&gt; | grep -iE &lt;span style="color:#e6db74"&gt;&amp;#34;ISR|replica|fetch&amp;#34;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Confirm scope in Prometheus (&lt;code&gt;prom-ep.hetzner.safetywing.dev&lt;/code&gt;):&lt;/p&gt;</description></item></channel></rss>