Meaning#
One or more Ceph OSDs (object storage daemons — the per-disk processes that store data) are marked down. Each OSD maps to a physical disk on a hetzner node; a down OSD reduces redundancy and capacity.
Fires when:
count(ceph_osd_up == 0) > 0for: 10m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.
Impact#
Ceph keeps serving IO from surviving replicas, so usually no outage. But redundancy is reduced and recovery/backfill load increases. Multiple OSDs down (or a full failure domain) can cause degraded/unavailable PGs and put PVCs across all environments at risk.
Diagnosis#
kubectl config use-context hetzner
# Which OSDs are down, and where
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd dfMap the down OSD(s) to pods and nodes:
kubectl -n rook-ceph get pods -l app=rook-ceph-osd -o wide
kubectl -n rook-ceph logs <rook-ceph-osd-...>
kubectl get nodes # node hosting the OSD NotReady?Mitigation#
- Locate the OSD via
ceph osd tree— note its host and the backing disk; check that node’s status. - Node down? Recover the hetzner node; once it returns, the OSD pod and OSD should come back
up/in. - Pod crashlooping? Read
kubectl -n rook-ceph logs <rook-ceph-osd-...>. Restart it:kubectl -n rook-ceph delete pod <rook-ceph-osd-...> - Disk failure? If the underlying disk is bad, let Rook re-provision: purge the dead OSD per Rook’s OSD-removal procedure and add a replacement disk. The 2 newer hetzner nodes have spare disks Rook can pick up as new OSDs.
- Let recovery finish. After the OSD is back, watch
ceph statusfor backfill/recovery to complete and PGs to return toactive+clean.
Do not mark OSDs out en masse without checking capacity headroom — that can trigger large rebalances.