Meaning#
The Rook-Ceph cluster is reporting HEALTH_ERR — Ceph has detected one or more error-level conditions and storage is at risk. This is the most severe Ceph health state.
Fires when:
ceph_health_status == 2for: 5m, severity page, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.
Impact#
Ceph backs the PVCs for workloads across all environments on the hetzner cluster. In HEALTH_ERR, IO may stall, volumes can be forced read-only, and writes can be blocked. Treat as an active or imminent storage outage affecting every environment that depends on Ceph-backed storage.
Diagnosis#
Use the Rook-Ceph toolbox for the authoritative cluster view:
kubectl config use-context hetzner
# Overall status and the specific error reasons
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail
# Drill into OSDs, mons, and capacity depending on the reported reasons
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph mon stat
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph dfInspect the Rook-managed pods and their logs:
kubectl -n rook-ceph get pods
kubectl -n rook-ceph logs <rook-ceph-osd-...> # for OSD-related errors
kubectl -n rook-ceph logs <rook-ceph-mon-...> # for mon-related errorsCross-check Grafana (grafana.safetywing.dev) and Prometheus (prom-ep.hetzner.safetywing.dev).
Mitigation#
HEALTH_ERR has many possible causes — let ceph health detail drive the response. Common ones:
- Read
ceph health detailfirst. Each error code (e.g.OSD_FULL,PG_DAMAGED,MON_DOWN,OSD_DOWN) points at the subsystem to fix. - OSD full / near-full (
OSD_FULL,OSD_BACKFILLFULL): free or add capacity immediately — see CephClusterNearFull. The 2 newer hetzner nodes have spare disks that Rook can provision as OSDs. - OSD(s) down: see CephOSDDown — check the node/disk, restart the OSD pod, let Rook re-provision.
- Mon issues: see CephMonOutOfQuorum — check mon pods and the nodes hosting them.
- PG/data damage (
PG_DAMAGED, inconsistent PGs): follow the Ceph health-checks docs for the specific code; a repair/scrub may be required. - Re-run
ceph statusafter each action and confirm the cluster returns toHEALTH_OK.
Raising full/nearfull ratios to clear a full-OSD error is a last resort — it buys time but does not fix the underlying capacity problem.