Meaning#
At least one Ceph monitor (mon) has dropped out of quorum. Mons maintain the cluster map and consensus; losing one reduces fault tolerance, and losing a majority halts the cluster.
Fires when:
count(ceph_mon_quorum_status == 0) > 0for: 10m, severity page, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.
Impact#
Mons are the control plane of Ceph. With one mon out of quorum the cluster still serves IO but has no redundancy margin; if quorum is lost entirely, all Ceph IO stops and PVCs across all environments on hetzner become unavailable.
Diagnosis#
kubectl config use-context hetzner
# Quorum membership and overall health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph mon stat
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detailCheck the mon pods and the nodes they run on:
kubectl -n rook-ceph get pods -l app=rook-ceph-mon -o wide
kubectl -n rook-ceph logs <rook-ceph-mon-...>
kubectl get nodes # is a hetzner node NotReady?Identify which mon is missing from quorum (compare ceph mon stat membership against the running mon pods).
Mitigation#
- Find the affected mon from
ceph mon stat/ceph statusand locate its pod and node. - Node down? If the hetzner node hosting the mon is
NotReady, recover the node — that usually restores the mon. With 5 nodes, Rook may also reschedule the mon once the node returns. - Pod crashlooping? Inspect
kubectl -n rook-ceph logs <rook-ceph-mon-...>; restart it by deleting the pod and let the deployment recreate it:kubectl -n rook-ceph delete pod <rook-ceph-mon-...> - Persistent failure? Let the Rook operator handle mon failover — confirm the operator pod (
rook-ceph-operator) is healthy; it can fail over/replace a bad mon. Avoid manualmonmapsurgery unless following Rook’s mon-recovery procedure. - Re-run
ceph mon statuntil all mons are back in quorum andceph statusis clean.