RabbitmqMemoryAlarm • SafetyWing Runbooks

Meaning#

A RabbitMQ node has crossed its memory high-watermark and raised a memory alarm. RabbitMQ responds by blocking all publishers across the cluster to protect the broker from running out of memory.

Fires when:

max(rabbitmq_alarms_memory_used_watermark{namespace="safetywing-<env>-infra"}) == 1

for: 5m, severity page, tier component.

Impact#

Publishing is blocked cluster-wide. Once any node hits the memory watermark, RabbitMQ throttles/blocks all connections that are publishing, so producers across every queue hang. Consumers keep draining, but new messages cannot be accepted. This typically surfaces as backend services timing out on publish and growing request latency until memory is reclaimed.

Diagnosis#

kubectl config use-context hetzner

kubectl get rabbitmqcluster -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/component=rabbitmq

# Confirm the alarm and see where memory is going
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics status
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics memory_breakdown
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics list_queues name messages messages_ready messages_unacknowledged memory

Identify which node is in alarm and current memory pressure:

max(rabbitmq_alarms_memory_used_watermark{namespace="safetywing-<env>-infra"})
rabbitmq_alarms_memory_used_watermark{namespace="safetywing-<env>-infra"}

In the management UI (Overview → Nodes) the node will show a red memory bar and a “memory alarm” badge; Memory detail breaks usage into queues, connections, binaries, etc.

Mitigation#

Find what is consuming memory. Use memory_breakdown and the queue list above. The usual cause is a queue with a large backlog of unacked/ready messages, or a runaway number of connections/channels.
Relieve the backlog. If a stuck or slow consumer is causing messages to pile up, fix or restart the consuming service (apps in safetywing-<env>-applications) so messages drain and memory frees:
```
kubectl get pods -n safetywing-<env>-applications | grep <consuming-service>
kubectl rollout restart deployment/<consuming-service> -n safetywing-<env>-applications
```
Close abusive connections if connection/channel count is the driver (visible in the management UI Connections tab); restart the misbehaving client.
Give the node more memory or raise the watermark by editing the RabbitmqCluster CR — raise the container memory limit (the operator derives the absolute watermark from it) and/or set the relative watermark:
```
kubectl edit rabbitmqcluster <name> -n safetywing-<env>-infra
```
```
spec:
  resources:
    limits:
      memory: 4Gi      # bump the limit; watermark scales with it
  rabbitmq:
    additionalConfig: |
      vm_memory_high_watermark.relative = 0.6
```
The operator performs a rolling update. Prefer raising the limit over loosening the watermark.
Scale out by increasing spec.replicas if the cluster is genuinely undersized for sustained throughput.
The alarm clears automatically once memory drops back below the watermark; confirm publishers are unblocked via the management UI and that rabbitmq_alarms_memory_used_watermark returns to 0.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#