RabbitmqDeadLetterMessages • SafetyWing Runbooks

Meaning#

A dead-letter queue (DLQ) holds one or more ready messages. DLQs are the topology chart’s {namespace}.deadletter queues — under normal operation they are empty.

Fires when:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra", queue=~".+[.]deadletter"}) > 0

for: 15m, severity ticket, tier component. The queue label identifies the affected DLQ (always ends in .deadletter).

Each rabbitmq-topology namespace wires a three-queue retry flow: the main queue {ns} dead-letters failed messages to {ns}.retry (which holds them for retryDelay, default 10 min, then re-publishes to the main queue). A message only lands in the dead-letter queue {ns}.deadletter when it is dead-lettered with the deadletter routing key — i.e. it has exhausted its retries or was explicitly rejected as terminally unprocessable. So a non-empty DLQ means “messages a consumer gave up on”, not transient backpressure.

Impact#

The dead-lettered messages represent work that was not processed — payments, notifications, sync events, etc., depending on the namespace. They will sit in the DLQ indefinitely (no TTL) until someone inspects, replays, or purges them. The business effect depends on the namespace, but the data is not lost — it is parked for manual handling. A steadily growing DLQ also consumes memory/disk like any other queue.

Diagnosis#

kubectl config use-context hetzner

# Which DLQ(s) hold messages, and how many?
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- \
  rabbitmq-diagnostics list_queues name messages_ready | grep '\.deadletter'

Inspect a few messages without consuming them (ackmode=reject_requeue_true) to find the failure cause. The x-death header records the original queue, the reason (rejected/expired/maxlen), and how many times it cycled:

kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- \
  rabbitmqadmin get queue=<namespace>.deadletter ackmode=reject_requeue_true count=5

Cross-check with the originating namespace’s consumer logs around the dead-letter timestamps:

kubectl logs deployment/<consuming-service> -n safetywing-<env>-applications --tail=200 | grep -i -E "error|reject|dead"

If the DLQ is filling continuously, the upstream consumer is failing every delivery — check it the same way as RabbitmqQueueBacklog, and confirm consumers are actually attached (RabbitmqQueueNoConsumers).

Mitigation#

Find the root cause from the x-death reason and the consumer logs — a poison message (bad payload), a downstream outage, or a code bug that rejects valid messages. Fix that first; replaying before the cause is fixed just re-fills the DLQ.

Replay the messages back onto the namespace’s main exchange once processing is healthy. For a one-off, a temporary Shovel from {ns}.deadletter to exchange {ns} is the safest move:

# via the management UI: Admin → Shovel Management → add a dynamic shovel
#   source queue:        <namespace>.deadletter
#   destination exchange: <namespace>
# delete the shovel once the queue drains.

Purge only messages confirmed unprocessable (and captured elsewhere if they matter):

kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- \
  rabbitmqctl purge_queue <namespace>.deadletter

Confirm the DLQ has drained and the alert clears:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra", queue=~".+[.]deadletter"})

References#

RabbitMQ dead lettering
Messages that cannot be delivered (the x-death header)
RabbitMQ Shovel plugin
rabbitmq-topology chart — the retry/dead-letter wiring