CephMonOutOfQuorum • SafetyWing Runbooks

Meaning#

At least one Ceph monitor (mon) has dropped out of quorum. Mons maintain the cluster map and consensus; losing one reduces fault tolerance, and losing a majority halts the cluster.

Fires when:

count(ceph_mon_quorum_status == 0) > 0

for: 10m, severity page, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Mons are the control plane of Ceph. With one mon out of quorum the cluster still serves IO but has no redundancy margin; if quorum is lost entirely, all Ceph IO stops and PVCs across all environments on hetzner become unavailable.

Diagnosis#

kubectl config use-context hetzner

# Quorum membership and overall health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph mon stat
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail

Check the mon pods and the nodes they run on:

kubectl -n rook-ceph get pods -l app=rook-ceph-mon -o wide
kubectl -n rook-ceph logs <rook-ceph-mon-...>
kubectl get nodes        # is a hetzner node NotReady?

Identify which mon is missing from quorum (compare ceph mon stat membership against the running mon pods).

Mitigation#

Find the affected mon from ceph mon stat / ceph status and locate its pod and node.
Node down? If the hetzner node hosting the mon is NotReady, recover the node — that usually restores the mon. With 5 nodes, Rook may also reschedule the mon once the node returns.
Pod crashlooping? Inspect kubectl -n rook-ceph logs <rook-ceph-mon-...>; restart it by deleting the pod and let the deployment recreate it:
```
kubectl -n rook-ceph delete pod <rook-ceph-mon-...>
```
Persistent failure? Let the Rook operator handle mon failover — confirm the operator pod (rook-ceph-operator) is healthy; it can fail over/replace a bad mon. Avoid manual monmap surgery unless following Rook’s mon-recovery procedure.
Re-run ceph mon stat until all mons are back in quorum and ceph status is clean.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#