Meaning#
Raw capacity usage of the Rook-Ceph cluster has crossed the configured threshold. As Ceph approaches full it first throttles, then refuses writes — so this alert is a capacity early-warning that needs action before it becomes an outage.
Fires when:
ceph_cluster_total_used_bytes / ceph_cluster_total_bytes > <ratio>for: 15m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.
Impact#
When Ceph hits its nearfull/backfillfull/full ratios it degrades to HEALTH_WARN then HEALTH_ERR, and at the full ratio it blocks writes and can force volumes read-only. Because Ceph backs PVCs across all environments on hetzner, a full cluster is a multi-environment storage outage.
Diagnosis#
kubectl config use-context hetzner
# Cluster and per-pool capacity
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph df
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
# Per-OSD usage — find the fullest OSDs / imbalance
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detailCheck Rook OSD pods to see current disks in play:
kubectl -n rook-ceph get pods -l app=rook-ceph-osd -o wideConfirm the trend in Grafana (grafana.safetywing.dev) to gauge time-to-full.
Mitigation#
Add capacity or reduce usage — in roughly this order:
- Add OSDs / disks. The 2 newer hetzner nodes have spare disks; have Rook provision them as new OSDs (via the CephCluster storage spec). This is the durable fix — more raw capacity lowers the usage ratio.
- Rebalance if usage is skewed across OSDs (
ceph osd dfshows imbalance): reweight the fullest OSDs (ceph osd reweight-by-utilizationor balancer module) so data spreads evenly instead of one OSD tripping the full ratio. - Free data. Delete unused PVCs/snapshots/orphaned RBD images and stale pool data across environments.
- Last resort: temporarily raise the
nearfull/backfillfull/fullratios (ceph osd set-nearfull-ratio/set-full-ratio) to keep writes flowing while real capacity is added. This only buys time and increases outage risk — never leave it as the fix.
Re-run ceph df after changes and confirm the ratio drops back below threshold.