CephHealthError • SafetyWing Runbooks

Meaning#

The Rook-Ceph cluster is reporting HEALTH_ERR — Ceph has detected one or more error-level conditions and storage is at risk. This is the most severe Ceph health state.

Fires when:

ceph_health_status == 2

for: 5m, severity page, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Ceph backs the PVCs for workloads across all environments on the hetzner cluster. In HEALTH_ERR, IO may stall, volumes can be forced read-only, and writes can be blocked. Treat as an active or imminent storage outage affecting every environment that depends on Ceph-backed storage.

Diagnosis#

Use the Rook-Ceph toolbox for the authoritative cluster view:

kubectl config use-context hetzner

# Overall status and the specific error reasons
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail

# Drill into OSDs, mons, and capacity depending on the reported reasons
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph mon stat
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph df

Inspect the Rook-managed pods and their logs:

kubectl -n rook-ceph get pods
kubectl -n rook-ceph logs <rook-ceph-osd-...>   # for OSD-related errors
kubectl -n rook-ceph logs <rook-ceph-mon-...>   # for mon-related errors

Cross-check Grafana (grafana.safetywing.dev) and Prometheus (prom-ep.hetzner.safetywing.dev).

Mitigation#

HEALTH_ERR has many possible causes — let ceph health detail drive the response. Common ones:

Read ceph health detail first. Each error code (e.g. OSD_FULL, PG_DAMAGED, MON_DOWN, OSD_DOWN) points at the subsystem to fix.
OSD full / near-full (OSD_FULL, OSD_BACKFILLFULL): free or add capacity immediately — see CephClusterNearFull. The 2 newer hetzner nodes have spare disks that Rook can provision as OSDs.
OSD(s) down: see CephOSDDown — check the node/disk, restart the OSD pod, let Rook re-provision.
Mon issues: see CephMonOutOfQuorum — check mon pods and the nodes hosting them.
PG/data damage (PG_DAMAGED, inconsistent PGs): follow the Ceph health-checks docs for the specific code; a repair/scrub may be required.
Re-run ceph status after each action and confirm the cluster returns to HEALTH_OK.

Raising full/nearfull ratios to clear a full-OSD error is a last resort — it buys time but does not fix the underlying capacity problem.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#