RabbitmqQueueBacklog • SafetyWing Runbooks

Meaning#

A queue has accumulated a large number of ready (undelivered) messages, meaning consumers are not keeping up with producers.

Fires when:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"}) > <threshold>

for: 15m, severity ticket, tier component. The queue label identifies the affected queue.

Impact#

Messages are being produced faster than they are consumed. Downstream processing is delayed, so whatever the queue feeds (notifications, payments, sync jobs, etc.) lags behind. A persistently growing backlog also consumes memory and disk and can eventually trip the memory or disk alarms and block publishing cluster-wide.

Diagnosis#

kubectl config use-context hetzner

kubectl get rabbitmqcluster -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/component=rabbitmq

# Per-queue depth, consumer count, and throughput
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- \
  rabbitmq-diagnostics list_queues name messages_ready messages_unacknowledged consumers consumer_utilisation

Confirm which queue is backed up and whether it is still growing:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"})
# rate of change over 15m — positive means the backlog is growing
deriv(rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"}[15m])
max by (queue) (rabbitmq_queue_consumers{namespace="safetywing-<env>-infra"})

In the management UI, open the queue (Queues → ) and check incoming vs deliver/ack rates and the number of attached consumers. If consumers are zero, see RabbitmqQueueNoConsumers.

Mitigation#

Identify the consuming service for the queue and verify it is running and healthy:

kubectl get pods -n safetywing-<env>-applications | grep <consuming-service>
kubectl logs deployment/<consuming-service> -n safetywing-<env>-applications --tail=100

Look for processing errors or slow downstreams (DB, external API) in the consumer logs that are slowing acks; resolve the bottleneck.

Scale the consumer to add parallelism if it is healthy but simply outpaced:

kubectl scale deployment/<consuming-service> -n safetywing-<env>-applications --replicas=<n>

Restart a stuck consumer if it is connected but not acking (low consumer_utilisation):

kubectl rollout restart deployment/<consuming-service> -n safetywing-<env>-applications

Check for poison messages repeatedly redelivered/rejected; route them to a dead-letter queue or purge if they are confirmed unprocessable.
Watch the backlog drain via the PromQL/management-UI rates above, and keep an eye on node memory/disk while it clears.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#