RabbitmqNodeDown • SafetyWing Runbooks

Meaning#

Fewer RabbitMQ nodes are reporting metrics than the configured number of replicas, meaning one or more cluster members are down or unreachable.

Fires when:

count(rabbitmq_build_info{namespace="safetywing-<env>-infra"}) < <replicas>

for: 5m, severity page, tier component.

Impact#

A missing node reduces capacity and redundancy. With quorum queues, losing a node erodes the quorum margin; losing a majority makes those queues unavailable for reads and writes. Classic mirrored/single-node queues hosted on the down node become unavailable until it returns. Sustained node loss risks a full cluster outage.

Diagnosis#

kubectl config use-context hetzner

# Cluster CR state and pod status
kubectl get rabbitmqcluster -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/component=rabbitmq -o wide
kubectl describe pod <rabbitmq-pod> -n safetywing-<env>-infra | tail -40

# Node / cluster health from inside a running pod
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics check_running
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics cluster_status
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics status

Confirm how many nodes are actually reporting:

count(rabbitmq_build_info{namespace="safetywing-<env>-infra"})

In the management UI (via kubectl rabbitmq manage <name> -n safetywing-<env>-infra), check Overview → Nodes for nodes shown as down and any quorum-queue minority warnings.

Mitigation#

Inspect events and logs of the affected pod for the cause (OOMKill, CrashLoopBackOff, scheduling/PVC issues):

kubectl get events -n safetywing-<env>-infra --sort-by=.lastTimestamp | tail -30
kubectl logs <rabbitmq-pod> -n safetywing-<env>-infra --previous

If the pod is stuck or unhealthy, let the operator recreate it by deleting it; the StatefulSet reschedules it with its PVC:
```
kubectl delete pod <rabbitmq-pod> -n safetywing-<env>-infra
```

Verify the node rejoins via peer discovery and reaches cluster_status:

kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics cluster_status

If a node is partitioned or its data is corrupt, check for network partitions (rabbitmq-diagnostics cluster_status reports partitions) and resolve per the partition-handling strategy before forcing anything.
For quorum queues at risk, do not delete additional pods until the recovered node has caught up; check member health in the management UI before any rolling action.
If a PVC or node-affinity issue blocks scheduling, fix the underlying storage/node problem rather than scaling down replicas in the RabbitmqCluster spec.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#