CephClusterNearFull • SafetyWing Runbooks

Meaning#

Raw capacity usage of the Rook-Ceph cluster has crossed the configured threshold. As Ceph approaches full it first throttles, then refuses writes — so this alert is a capacity early-warning that needs action before it becomes an outage.

Fires when:

ceph_cluster_total_used_bytes / ceph_cluster_total_bytes > <ratio>

for: 15m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

When Ceph hits its nearfull/backfillfull/full ratios it degrades to HEALTH_WARN then HEALTH_ERR, and at the full ratio it blocks writes and can force volumes read-only. Because Ceph backs PVCs across all environments on hetzner, a full cluster is a multi-environment storage outage.

Diagnosis#

kubectl config use-context hetzner

# Cluster and per-pool capacity
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph df
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status

# Per-OSD usage — find the fullest OSDs / imbalance
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail

Check Rook OSD pods to see current disks in play:

kubectl -n rook-ceph get pods -l app=rook-ceph-osd -o wide

Confirm the trend in Grafana (grafana.safetywing.dev) to gauge time-to-full.

Mitigation#

Add capacity or reduce usage — in roughly this order:

Add OSDs / disks. The 2 newer hetzner nodes have spare disks; have Rook provision them as new OSDs (via the CephCluster storage spec). This is the durable fix — more raw capacity lowers the usage ratio.
Rebalance if usage is skewed across OSDs (ceph osd df shows imbalance): reweight the fullest OSDs (ceph osd reweight-by-utilization or balancer module) so data spreads evenly instead of one OSD tripping the full ratio.
Free data. Delete unused PVCs/snapshots/orphaned RBD images and stale pool data across environments.
Last resort: temporarily raise the nearfull/backfillfull/full ratios (ceph osd set-nearfull-ratio / set-full-ratio) to keep writes flowing while real capacity is added. This only buys time and increases outage risk — never leave it as the fix.

Re-run ceph df after changes and confirm the ratio drops back below threshold.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#