Meaning#
Fewer RabbitMQ nodes are reporting metrics than the configured number of replicas, meaning one or more cluster members are down or unreachable.
Fires when:
count(rabbitmq_build_info{namespace="safetywing-<env>-infra"}) < <replicas>for: 5m, severity page, tier component.
Impact#
A missing node reduces capacity and redundancy. With quorum queues, losing a node erodes the quorum margin; losing a majority makes those queues unavailable for reads and writes. Classic mirrored/single-node queues hosted on the down node become unavailable until it returns. Sustained node loss risks a full cluster outage.
Diagnosis#
kubectl config use-context hetzner
# Cluster CR state and pod status
kubectl get rabbitmqcluster -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/component=rabbitmq -o wide
kubectl describe pod <rabbitmq-pod> -n safetywing-<env>-infra | tail -40
# Node / cluster health from inside a running pod
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics check_running
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics cluster_status
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics statusConfirm how many nodes are actually reporting:
count(rabbitmq_build_info{namespace="safetywing-<env>-infra"})In the management UI (via kubectl rabbitmq manage <name> -n safetywing-<env>-infra), check Overview → Nodes for nodes shown as down and any quorum-queue minority warnings.
Mitigation#
- Inspect events and logs of the affected pod for the cause (OOMKill, CrashLoopBackOff, scheduling/PVC issues):
kubectl get events -n safetywing-<env>-infra --sort-by=.lastTimestamp | tail -30 kubectl logs <rabbitmq-pod> -n safetywing-<env>-infra --previous - If the pod is stuck or unhealthy, let the operator recreate it by deleting it; the StatefulSet reschedules it with its PVC:
kubectl delete pod <rabbitmq-pod> -n safetywing-<env>-infra - Verify the node rejoins via peer discovery and reaches
cluster_status:kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics cluster_status - If a node is partitioned or its data is corrupt, check for network partitions (
rabbitmq-diagnostics cluster_statusreports partitions) and resolve per the partition-handling strategy before forcing anything. - For quorum queues at risk, do not delete additional pods until the recovered node has caught up; check member health in the management UI before any rolling action.
- If a PVC or node-affinity issue blocks scheduling, fix the underlying storage/node problem rather than scaling down replicas in the
RabbitmqClusterspec.