CephHealthWarning • SafetyWing Runbooks

Meaning#

The Rook-Ceph cluster has been in HEALTH_WARN for a sustained period. Ceph is functional but degraded — something needs attention before it escalates to HEALTH_ERR.

Fires when:

ceph_health_status == 1

for: 30m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Usually no immediate outage — IO continues. But HEALTH_WARN indicates reduced redundancy or headroom (degraded PGs, an OSD nearing full, a flapping mon, etc.) that affects storage backing PVCs across all environments. Left unaddressed it can progress to HEALTH_ERR and read-only/blocked writes.

Diagnosis#

kubectl config use-context hetzner

# What exactly is warning?
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail

# Follow the reported reason into the relevant view
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph mon stat
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph df

Check Rook pods if a component is implicated:

kubectl -n rook-ceph get pods
kubectl -n rook-ceph logs <rook-ceph-osd-...>

Mitigation#

Let ceph health detail name the warning, then act on it:

Capacity warnings (nearfull, OSD_NEARFULL): see CephClusterNearFull — add OSDs/disks (the 2 newer hetzner nodes have spare disks), reweight, or free data.
OSD down/out: see CephOSDDown.
Mon out of quorum / clock skew: see CephMonOutOfQuorum; for clock skew, check node time sync.
Degraded/undersized PGs, recovery in progress: often self-heals — confirm recovery is progressing in ceph status; if stalled, check for down OSDs.
Re-run ceph status and confirm return to HEALTH_OK.

Raising near-full ratios to silence a capacity warning is a last resort.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#