RabbitmqDiskAlarm • SafetyWing Runbooks

Meaning#

A RabbitMQ node has dropped below its free-disk-space watermark and raised a disk alarm. RabbitMQ blocks publishers to avoid filling the disk and corrupting the message store.

Fires when:

max(rabbitmq_alarms_free_disk_space_watermark{namespace="safetywing-<env>-infra"}) == 1

for: 5m, severity page, tier component.

Impact#

Publishing is blocked cluster-wide. As with the memory alarm, once any node trips the free-disk watermark RabbitMQ blocks all publishing connections until free space recovers. Consumers continue, but producers hang and dependent backend services back up. If the disk fills completely the node can crash and lose durability guarantees, so this must be cleared promptly.

Diagnosis#

kubectl config use-context hetzner

kubectl get rabbitmqcluster -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/component=rabbitmq

# Confirm the alarm and current free space vs limit
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics status

# Actual disk usage of the data volume and the PVC backing it
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- df -h /var/lib/rabbitmq
kubectl get pvc -n safetywing-<env>-infra

Identify the node in alarm and which queues hold the most messages on disk:

max(rabbitmq_alarms_free_disk_space_watermark{namespace="safetywing-<env>-infra"})
rabbitmq_alarms_free_disk_space_watermark{namespace="safetywing-<env>-infra"}
max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"})

In the management UI (Overview → Nodes) the node shows a disk alarm badge and free-space figure against the limit.

Mitigation#

Drain or purge backlog that is consuming disk. If a stuck consumer caused durable messages to accumulate, fix/restart the consuming service (apps in safetywing-<env>-applications) so messages are acked and removed:
```
kubectl rollout restart deployment/<consuming-service> -n safetywing-<env>-applications
```
For a queue with disposable backlog, purge it via the management UI or rabbitmqctl purge_queue <name>.
Expand the PVC. The Rook-Ceph StorageClass supports volume expansion. Increase spec.persistence.storage on the RabbitmqCluster:
```
kubectl edit rabbitmqcluster <name> -n safetywing-<env>-infra
```
```
spec:
  persistence:
    storage: 20Gi   # raise above current size
```
The operator/StatefulSet propagates the resize to the PVCs; confirm with kubectl get pvc -n safetywing-<env>-infra and re-check df -h.
If expansion does not apply immediately, the volume may need the pod to restart; verify the StorageClass has allowVolumeExpansion: true and the filesystem grew.
Lower the disk-free watermark only as a short-term relief if you cannot expand quickly, via spec.rabbitmq.additionalConfig (disk_free_limit.relative or absolute) — but follow up with real capacity. Reducing it too far risks the node filling the disk entirely.
The alarm clears automatically once free space rises above the watermark; confirm rabbitmq_alarms_free_disk_space_watermark returns to 0 and publishers are unblocked.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#