Ceph on SafetyWing Runbooks

CephClusterNearFull

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Raw capacity usage of the Rook-Ceph cluster has crossed the configured threshold. As Ceph approaches full it first throttles, then refuses writes — so this alert is a capacity early-warning that needs action before it becomes an outage.

Fires when:

ceph_cluster_total_used_bytes / ceph_cluster_total_bytes > <ratio>

for: 15m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

When Ceph hits its nearfull/backfillfull/full ratios it degrades to HEALTH_WARN then HEALTH_ERR, and at the full ratio it blocks writes and can force volumes read-only. Because Ceph backs PVCs across all environments on hetzner, a full cluster is a multi-environment storage outage.

CephHealthError

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Rook-Ceph cluster is reporting HEALTH_ERR — Ceph has detected one or more error-level conditions and storage is at risk. This is the most severe Ceph health state.

Fires when:

ceph_health_status == 2

for: 5m, severity page, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Ceph backs the PVCs for workloads across all environments on the hetzner cluster. In HEALTH_ERR, IO may stall, volumes can be forced read-only, and writes can be blocked. Treat as an active or imminent storage outage affecting every environment that depends on Ceph-backed storage.

CephHealthWarning

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Rook-Ceph cluster has been in HEALTH_WARN for a sustained period. Ceph is functional but degraded — something needs attention before it escalates to HEALTH_ERR.

Fires when:

ceph_health_status == 1

for: 30m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Usually no immediate outage — IO continues. But HEALTH_WARN indicates reduced redundancy or headroom (degraded PGs, an OSD nearing full, a flapping mon, etc.) that affects storage backing PVCs across all environments. Left unaddressed it can progress to HEALTH_ERR and read-only/blocked writes.

CephMonOutOfQuorum

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

At least one Ceph monitor (mon) has dropped out of quorum. Mons maintain the cluster map and consensus; losing one reduces fault tolerance, and losing a majority halts the cluster.

Fires when:

count(ceph_mon_quorum_status == 0) > 0

for: 10m, severity page, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Mons are the control plane of Ceph. With one mon out of quorum the cluster still serves IO but has no redundancy margin; if quorum is lost entirely, all Ceph IO stops and PVCs across all environments on hetzner become unavailable.

CephOSDDown

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

One or more Ceph OSDs (object storage daemons — the per-disk processes that store data) are marked down. Each OSD maps to a physical disk on a hetzner node; a down OSD reduces redundancy and capacity.

Fires when:

count(ceph_osd_up == 0) > 0

for: 10m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Ceph keeps serving IO from surviving replicas, so usually no outage. But redundancy is reduced and recovery/backfill load increases. Multiple OSDs down (or a full failure domain) can cause degraded/unavailable PGs and put PVCs across all environments at risk.