Meaning#
A RabbitMQ node has dropped below its free-disk-space watermark and raised a disk alarm. RabbitMQ blocks publishers to avoid filling the disk and corrupting the message store.
Fires when:
max(rabbitmq_alarms_free_disk_space_watermark{namespace="safetywing-<env>-infra"}) == 1for: 5m, severity page, tier component.
Impact#
Publishing is blocked cluster-wide. As with the memory alarm, once any node trips the free-disk watermark RabbitMQ blocks all publishing connections until free space recovers. Consumers continue, but producers hang and dependent backend services back up. If the disk fills completely the node can crash and lose durability guarantees, so this must be cleared promptly.
Diagnosis#
kubectl config use-context hetzner
kubectl get rabbitmqcluster -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/component=rabbitmq
# Confirm the alarm and current free space vs limit
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics status
# Actual disk usage of the data volume and the PVC backing it
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- df -h /var/lib/rabbitmq
kubectl get pvc -n safetywing-<env>-infraIdentify the node in alarm and which queues hold the most messages on disk:
max(rabbitmq_alarms_free_disk_space_watermark{namespace="safetywing-<env>-infra"})
rabbitmq_alarms_free_disk_space_watermark{namespace="safetywing-<env>-infra"}
max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"})In the management UI (Overview → Nodes) the node shows a disk alarm badge and free-space figure against the limit.
Mitigation#
- Drain or purge backlog that is consuming disk. If a stuck consumer caused durable messages to accumulate, fix/restart the consuming service (apps in
safetywing-<env>-applications) so messages are acked and removed:For a queue with disposable backlog, purge it via the management UI orkubectl rollout restart deployment/<consuming-service> -n safetywing-<env>-applicationsrabbitmqctl purge_queue <name>. - Expand the PVC. The Rook-Ceph StorageClass supports volume expansion. Increase
spec.persistence.storageon theRabbitmqCluster:kubectl edit rabbitmqcluster <name> -n safetywing-<env>-infraThe operator/StatefulSet propagates the resize to the PVCs; confirm withspec: persistence: storage: 20Gi # raise above current sizekubectl get pvc -n safetywing-<env>-infraand re-checkdf -h. - If expansion does not apply immediately, the volume may need the pod to restart; verify the StorageClass has
allowVolumeExpansion: trueand the filesystem grew. - Lower the disk-free watermark only as a short-term relief if you cannot expand quickly, via
spec.rabbitmq.additionalConfig(disk_free_limit.relativeor absolute) — but follow up with real capacity. Reducing it too far risks the node filling the disk entirely. - The alarm clears automatically once free space rises above the watermark; confirm
rabbitmq_alarms_free_disk_space_watermarkreturns to 0 and publishers are unblocked.