Meaning#

Fewer RabbitMQ nodes are reporting metrics than the configured number of replicas, meaning one or more cluster members are down or unreachable.

Fires when:

count(rabbitmq_build_info{namespace="safetywing-<env>-infra"}) < <replicas>

for: 5m, severity page, tier component.

Impact#

A missing node reduces capacity and redundancy. With quorum queues, losing a node erodes the quorum margin; losing a majority makes those queues unavailable for reads and writes. Classic mirrored/single-node queues hosted on the down node become unavailable until it returns. Sustained node loss risks a full cluster outage.

Diagnosis#

kubectl config use-context hetzner

# Cluster CR state and pod status
kubectl get rabbitmqcluster -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/component=rabbitmq -o wide
kubectl describe pod <rabbitmq-pod> -n safetywing-<env>-infra | tail -40

# Node / cluster health from inside a running pod
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics check_running
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics cluster_status
kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics status

Confirm how many nodes are actually reporting:

count(rabbitmq_build_info{namespace="safetywing-<env>-infra"})

In the management UI (via kubectl rabbitmq manage <name> -n safetywing-<env>-infra), check Overview → Nodes for nodes shown as down and any quorum-queue minority warnings.

Mitigation#

  1. Inspect events and logs of the affected pod for the cause (OOMKill, CrashLoopBackOff, scheduling/PVC issues):
    kubectl get events -n safetywing-<env>-infra --sort-by=.lastTimestamp | tail -30
    kubectl logs <rabbitmq-pod> -n safetywing-<env>-infra --previous
  2. If the pod is stuck or unhealthy, let the operator recreate it by deleting it; the StatefulSet reschedules it with its PVC:
    kubectl delete pod <rabbitmq-pod> -n safetywing-<env>-infra
  3. Verify the node rejoins via peer discovery and reaches cluster_status:
    kubectl exec -n safetywing-<env>-infra <rabbitmq-pod> -- rabbitmq-diagnostics cluster_status
  4. If a node is partitioned or its data is corrupt, check for network partitions (rabbitmq-diagnostics cluster_status reports partitions) and resolve per the partition-handling strategy before forcing anything.
  5. For quorum queues at risk, do not delete additional pods until the recovered node has caught up; check member health in the management UI before any rolling action.
  6. If a PVC or node-affinity issue blocks scheduling, fix the underlying storage/node problem rather than scaling down replicas in the RabbitmqCluster spec.

References#