Meaning#
The Rook-Ceph cluster has been in HEALTH_WARN for a sustained period. Ceph is functional but degraded — something needs attention before it escalates to HEALTH_ERR.
Fires when:
ceph_health_status == 1for: 30m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.
Impact#
Usually no immediate outage — IO continues. But HEALTH_WARN indicates reduced redundancy or headroom (degraded PGs, an OSD nearing full, a flapping mon, etc.) that affects storage backing PVCs across all environments. Left unaddressed it can progress to HEALTH_ERR and read-only/blocked writes.
Diagnosis#
kubectl config use-context hetzner
# What exactly is warning?
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail
# Follow the reported reason into the relevant view
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph mon stat
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph dfCheck Rook pods if a component is implicated:
kubectl -n rook-ceph get pods
kubectl -n rook-ceph logs <rook-ceph-osd-...>Mitigation#
Let ceph health detail name the warning, then act on it:
- Capacity warnings (
nearfull,OSD_NEARFULL): see CephClusterNearFull — add OSDs/disks (the 2 newer hetzner nodes have spare disks), reweight, or free data. - OSD down/out: see CephOSDDown.
- Mon out of quorum / clock skew: see CephMonOutOfQuorum; for clock skew, check node time sync.
- Degraded/undersized PGs, recovery in progress: often self-heals — confirm recovery is progressing in
ceph status; if stalled, check for down OSDs. - Re-run
ceph statusand confirm return toHEALTH_OK.
Raising near-full ratios to silence a capacity warning is a last resort.