<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>RabbitMQ on SafetyWing Runbooks</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/</link><description>Recent content in RabbitMQ on SafetyWing Runbooks</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://runbooks.safetywing.dev/runbooks/rabbitmq/index.xml" rel="self" type="application/rss+xml"/><item><title>RabbitmqDeadLetterMessages</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdeadlettermessages/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdeadlettermessages/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A dead-letter queue (DLQ) holds one or more ready messages. DLQs are the topology chart&amp;rsquo;s &lt;code&gt;{namespace}.deadletter&lt;/code&gt; queues — under normal operation they are empty.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;queue&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_queue_messages_ready{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;, queue&lt;span style="color:#f92672"&gt;=~&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;.+[.]deadletter&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;. The &lt;code&gt;queue&lt;/code&gt; label identifies the affected DLQ (always ends in &lt;code&gt;.deadletter&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Each &lt;code&gt;rabbitmq-topology&lt;/code&gt; namespace wires a three-queue retry flow: the main queue &lt;code&gt;{ns}&lt;/code&gt; dead-letters failed messages to &lt;code&gt;{ns}.retry&lt;/code&gt; (which holds them for &lt;code&gt;retryDelay&lt;/code&gt;, default 10 min, then re-publishes to the main queue). A message only lands in the &lt;strong&gt;dead-letter queue&lt;/strong&gt; &lt;code&gt;{ns}.deadletter&lt;/code&gt; when it is dead-lettered with the &lt;code&gt;deadletter&lt;/code&gt; routing key — i.e. it has &lt;strong&gt;exhausted its retries&lt;/strong&gt; or was explicitly rejected as terminally unprocessable. So a non-empty DLQ means &amp;ldquo;messages a consumer gave up on&amp;rdquo;, not transient backpressure.&lt;/p&gt;</description></item><item><title>RabbitmqDiskAlarm</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdiskalarm/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqdiskalarm/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A RabbitMQ node has dropped below its free-disk-space watermark and raised a disk alarm. RabbitMQ blocks publishers to avoid filling the disk and corrupting the message store.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_alarms_free_disk_space_watermark{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Publishing is blocked cluster-wide.&lt;/strong&gt; As with the memory alarm, once any node trips the free-disk watermark RabbitMQ blocks all publishing connections until free space recovers. Consumers continue, but producers hang and dependent backend services back up. If the disk fills completely the node can crash and lose durability guarantees, so this must be cleared promptly.&lt;/p&gt;</description></item><item><title>RabbitmqMemoryAlarm</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqmemoryalarm/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqmemoryalarm/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A RabbitMQ node has crossed its memory high-watermark and raised a memory alarm. RabbitMQ responds by blocking all publishers across the cluster to protect the broker from running out of memory.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_alarms_memory_used_watermark{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Publishing is blocked cluster-wide.&lt;/strong&gt; Once any node hits the memory watermark, RabbitMQ throttles/blocks all connections that are publishing, so producers across every queue hang. Consumers keep draining, but new messages cannot be accepted. This typically surfaces as backend services timing out on publish and growing request latency until memory is reclaimed.&lt;/p&gt;</description></item><item><title>RabbitmqNodeDown</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqnodedown/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqnodedown/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Fewer RabbitMQ nodes are reporting metrics than the configured number of replicas, meaning one or more cluster members are down or unreachable.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;count&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_build_info{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;replicas&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A missing node reduces capacity and redundancy. With quorum queues, losing a node erodes the quorum margin; losing a majority makes those queues unavailable for reads and writes. Classic mirrored/single-node queues hosted on the down node become unavailable until it returns. Sustained node loss risks a full cluster outage.&lt;/p&gt;</description></item><item><title>RabbitmqQueueBacklog</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuebacklog/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuebacklog/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A queue has accumulated a large number of ready (undelivered) messages, meaning consumers are not keeping up with producers.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;queue&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_queue_messages_ready{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;threshold&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;. The &lt;code&gt;queue&lt;/code&gt; label identifies the affected queue.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Messages are being produced faster than they are consumed. Downstream processing is delayed, so whatever the queue feeds (notifications, payments, sync jobs, etc.) lags behind. A persistently growing backlog also consumes memory and disk and can eventually trip the &lt;a href="RabbitmqMemoryAlarm"&gt;memory&lt;/a&gt; or &lt;a href="RabbitmqDiskAlarm"&gt;disk&lt;/a&gt; alarms and block publishing cluster-wide.&lt;/p&gt;</description></item><item><title>RabbitmqQueueNoConsumers</title><link>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuenoconsumers/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/rabbitmq/rabbitmqqueuenoconsumers/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A queue has ready messages but zero consumers attached, so nothing is draining it. Messages will sit indefinitely until a consumer connects.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;queue&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_queue_messages_ready{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;and&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;max&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;by&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;queue&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;(&lt;/span&gt;rabbitmq_queue_consumers{namespace&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;safetywing-&amp;lt;env&amp;gt;-infra&lt;/span&gt;&amp;#34;}&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;component&lt;/code&gt;. The &lt;code&gt;queue&lt;/code&gt; label identifies the affected queue.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Work enqueued on this queue is not being processed at all. Unlike a slow backlog, there is no progress whatsoever, so the dependent feature is effectively down. The backlog will keep growing and can eventually trip the &lt;a href="RabbitmqMemoryAlarm"&gt;memory&lt;/a&gt; or &lt;a href="RabbitmqDiskAlarm"&gt;disk&lt;/a&gt; alarms and block publishing cluster-wide.&lt;/p&gt;</description></item></channel></rss>