Meaning#
A queue has accumulated a large number of ready (undelivered) messages, meaning consumers are not keeping up with producers.
Fires when:
max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"}) > <threshold>for: 15m, severity ticket, tier component. The queue label identifies the affected queue.
Impact#
Messages are being produced faster than they are consumed. Downstream processing is delayed, so whatever the queue feeds (notifications, payments, sync jobs, etc.) lags behind. A persistently growing backlog also consumes memory and disk and can eventually trip the memory or disk alarms and block publishing cluster-wide.
Diagnosis#
kubectl config use-context hetzner
kubectl get rabbitmqcluster -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/component=rabbitmq
# Per-queue depth, consumer count, and throughput
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- \
rabbitmq-diagnostics list_queues name messages_ready messages_unacknowledged consumers consumer_utilisationConfirm which queue is backed up and whether it is still growing:
max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"})
# rate of change over 15m — positive means the backlog is growing
deriv(rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"}[15m])
max by (queue) (rabbitmq_queue_consumers{namespace="safetywing-<env>-infra"})In the management UI, open the queue (Queues →
Mitigation#
- Identify the consuming service for the
queueand verify it is running and healthy:kubectl get pods -n safetywing-<env>-applications | grep <consuming-service> kubectl logs deployment/<consuming-service> -n safetywing-<env>-applications --tail=100 - Look for processing errors or slow downstreams (DB, external API) in the consumer logs that are slowing acks; resolve the bottleneck.
- Scale the consumer to add parallelism if it is healthy but simply outpaced:
kubectl scale deployment/<consuming-service> -n safetywing-<env>-applications --replicas=<n> - Restart a stuck consumer if it is connected but not acking (low
consumer_utilisation):kubectl rollout restart deployment/<consuming-service> -n safetywing-<env>-applications - Check for poison messages repeatedly redelivered/rejected; route them to a dead-letter queue or purge if they are confirmed unprocessable.
- Watch the backlog drain via the PromQL/management-UI rates above, and keep an eye on node memory/disk while it clears.