RabbitMQ on SafetyWing Runbooks

RabbitmqDeadLetterMessages

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A dead-letter queue (DLQ) holds one or more ready messages. DLQs are the topology chart’s {namespace}.deadletter queues — under normal operation they are empty.

Fires when:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra", queue=~".+[.]deadletter"}) > 0

for: 15m, severity ticket, tier component. The queue label identifies the affected DLQ (always ends in .deadletter).

Each rabbitmq-topology namespace wires a three-queue retry flow: the main queue {ns} dead-letters failed messages to {ns}.retry (which holds them for retryDelay, default 10 min, then re-publishes to the main queue). A message only lands in the dead-letter queue {ns}.deadletter when it is dead-lettered with the deadletter routing key — i.e. it has exhausted its retries or was explicitly rejected as terminally unprocessable. So a non-empty DLQ means “messages a consumer gave up on”, not transient backpressure.

RabbitmqDiskAlarm

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A RabbitMQ node has dropped below its free-disk-space watermark and raised a disk alarm. RabbitMQ blocks publishers to avoid filling the disk and corrupting the message store.

Fires when:

max(rabbitmq_alarms_free_disk_space_watermark{namespace="safetywing-<env>-infra"}) == 1

for: 5m, severity page, tier component.

Impact#

Publishing is blocked cluster-wide. As with the memory alarm, once any node trips the free-disk watermark RabbitMQ blocks all publishing connections until free space recovers. Consumers continue, but producers hang and dependent backend services back up. If the disk fills completely the node can crash and lose durability guarantees, so this must be cleared promptly.

RabbitmqMemoryAlarm

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A RabbitMQ node has crossed its memory high-watermark and raised a memory alarm. RabbitMQ responds by blocking all publishers across the cluster to protect the broker from running out of memory.

Fires when:

max(rabbitmq_alarms_memory_used_watermark{namespace="safetywing-<env>-infra"}) == 1

for: 5m, severity page, tier component.

Impact#

Publishing is blocked cluster-wide. Once any node hits the memory watermark, RabbitMQ throttles/blocks all connections that are publishing, so producers across every queue hang. Consumers keep draining, but new messages cannot be accepted. This typically surfaces as backend services timing out on publish and growing request latency until memory is reclaimed.

RabbitmqNodeDown

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Fewer RabbitMQ nodes are reporting metrics than the configured number of replicas, meaning one or more cluster members are down or unreachable.

Fires when:

count(rabbitmq_build_info{namespace="safetywing-<env>-infra"}) < <replicas>

for: 5m, severity page, tier component.

Impact#

A missing node reduces capacity and redundancy. With quorum queues, losing a node erodes the quorum margin; losing a majority makes those queues unavailable for reads and writes. Classic mirrored/single-node queues hosted on the down node become unavailable until it returns. Sustained node loss risks a full cluster outage.

RabbitmqQueueBacklog

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A queue has accumulated a large number of ready (undelivered) messages, meaning consumers are not keeping up with producers.

Fires when:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"}) > <threshold>

for: 15m, severity ticket, tier component. The queue label identifies the affected queue.

Impact#

Messages are being produced faster than they are consumed. Downstream processing is delayed, so whatever the queue feeds (notifications, payments, sync jobs, etc.) lags behind. A persistently growing backlog also consumes memory and disk and can eventually trip the memory or disk alarms and block publishing cluster-wide.

RabbitmqQueueNoConsumers

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A queue has ready messages but zero consumers attached, so nothing is draining it. Messages will sit indefinitely until a consumer connects.

Fires when:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"}) > 0
and
max by (queue) (rabbitmq_queue_consumers{namespace="safetywing-<env>-infra"}) == 0

for: 15m, severity ticket, tier component. The queue label identifies the affected queue.

Impact#

Work enqueued on this queue is not being processed at all. Unlike a slow backlog, there is no progress whatsoever, so the dependent feature is effectively down. The backlog will keep growing and can eventually trip the memory or disk alarms and block publishing cluster-wide.