<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Ceph on SafetyWing Runbooks</title><link>https://runbooks.safetywing.dev/runbooks/ceph/</link><description>Recent content in Ceph on SafetyWing Runbooks</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://runbooks.safetywing.dev/runbooks/ceph/index.xml" rel="self" type="application/rss+xml"/><item><title>CephClusterNearFull</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephclusternearfull/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephclusternearfull/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Raw capacity usage of the Rook-Ceph cluster has crossed the configured threshold. As Ceph approaches full it first throttles, then refuses writes — so this alert is a capacity early-warning that needs action before it becomes an outage.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ceph_cluster_total_used_bytes &lt;span style="color:#f92672"&gt;/&lt;/span&gt; ceph_cluster_total_bytes &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt;ratio&lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;When Ceph hits its &lt;code&gt;nearfull&lt;/code&gt;/&lt;code&gt;backfillfull&lt;/code&gt;/&lt;code&gt;full&lt;/code&gt; ratios it degrades to &lt;code&gt;HEALTH_WARN&lt;/code&gt; then &lt;code&gt;HEALTH_ERR&lt;/code&gt;, and at the full ratio it &lt;strong&gt;blocks writes&lt;/strong&gt; and can force volumes read-only. Because Ceph backs PVCs across &lt;strong&gt;all environments&lt;/strong&gt; on hetzner, a full cluster is a multi-environment storage outage.&lt;/p&gt;</description></item><item><title>CephHealthError</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephhealtherror/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephhealtherror/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Rook-Ceph cluster is reporting &lt;code&gt;HEALTH_ERR&lt;/code&gt; — Ceph has detected one or more error-level conditions and storage is at risk. This is the most severe Ceph health state.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ceph_health_status &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph backs the PVCs for workloads across &lt;strong&gt;all environments&lt;/strong&gt; on the hetzner cluster. In &lt;code&gt;HEALTH_ERR&lt;/code&gt;, IO may stall, volumes can be forced read-only, and writes can be blocked. Treat as an active or imminent storage outage affecting every environment that depends on Ceph-backed storage.&lt;/p&gt;</description></item><item><title>CephHealthWarning</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephhealthwarning/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephhealthwarning/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Rook-Ceph cluster has been in &lt;code&gt;HEALTH_WARN&lt;/code&gt; for a sustained period. Ceph is functional but degraded — something needs attention before it escalates to &lt;code&gt;HEALTH_ERR&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ceph_health_status &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 30m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Usually no immediate outage — IO continues. But &lt;code&gt;HEALTH_WARN&lt;/code&gt; indicates reduced redundancy or headroom (degraded PGs, an OSD nearing full, a flapping mon, etc.) that affects storage backing PVCs across &lt;strong&gt;all environments&lt;/strong&gt;. Left unaddressed it can progress to &lt;code&gt;HEALTH_ERR&lt;/code&gt; and read-only/blocked writes.&lt;/p&gt;</description></item><item><title>CephMonOutOfQuorum</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephmonoutofquorum/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephmonoutofquorum/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;At least one Ceph monitor (mon) has dropped out of quorum. Mons maintain the cluster map and consensus; losing one reduces fault tolerance, and losing a majority halts the cluster.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;count&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;ceph_mon_quorum_status &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Mons are the control plane of Ceph. With one mon out of quorum the cluster still serves IO but has no redundancy margin; if quorum is lost entirely, all Ceph IO stops and PVCs across &lt;strong&gt;all environments&lt;/strong&gt; on hetzner become unavailable.&lt;/p&gt;</description></item><item><title>CephOSDDown</title><link>https://runbooks.safetywing.dev/runbooks/ceph/cephosddown/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/ceph/cephosddown/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;One or more Ceph OSDs (object storage daemons — the per-disk processes that store data) are marked &lt;code&gt;down&lt;/code&gt;. Each OSD maps to a physical disk on a hetzner node; a down OSD reduces redundancy and capacity.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;count&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;ceph_osd_up &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 10m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label — only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Ceph keeps serving IO from surviving replicas, so usually no outage. But redundancy is reduced and recovery/backfill load increases. Multiple OSDs down (or a full failure domain) can cause degraded/unavailable PGs and put PVCs across &lt;strong&gt;all environments&lt;/strong&gt; at risk.&lt;/p&gt;</description></item></channel></rss>