CephOSDDown • SafetyWing Runbooks

Meaning#

One or more Ceph OSDs (object storage daemons — the per-disk processes that store data) are marked down. Each OSD maps to a physical disk on a hetzner node; a down OSD reduces redundancy and capacity.

Fires when:

count(ceph_osd_up == 0) > 0

for: 10m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Ceph keeps serving IO from surviving replicas, so usually no outage. But redundancy is reduced and recovery/backfill load increases. Multiple OSDs down (or a full failure domain) can cause degraded/unavailable PGs and put PVCs across all environments at risk.

Diagnosis#

kubectl config use-context hetzner

# Which OSDs are down, and where
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df

Map the down OSD(s) to pods and nodes:

kubectl -n rook-ceph get pods -l app=rook-ceph-osd -o wide
kubectl -n rook-ceph logs <rook-ceph-osd-...>
kubectl get nodes        # node hosting the OSD NotReady?

Mitigation#

Locate the OSD via ceph osd tree — note its host and the backing disk; check that node’s status.
Node down? Recover the hetzner node; once it returns, the OSD pod and OSD should come back up/in.
Pod crashlooping? Read kubectl -n rook-ceph logs <rook-ceph-osd-...>. Restart it:
```
kubectl -n rook-ceph delete pod <rook-ceph-osd-...>
```
Disk failure? If the underlying disk is bad, let Rook re-provision: purge the dead OSD per Rook’s OSD-removal procedure and add a replacement disk. The 2 newer hetzner nodes have spare disks Rook can pick up as new OSDs.
Let recovery finish. After the OSD is back, watch ceph status for backfill/recovery to complete and PGs to return to active+clean.

Do not mark OSDs out en masse without checking capacity headroom — that can trigger large rebalances.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#