<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Elasticsearch on SafetyWing Runbooks</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/</link><description>Recent content in Elasticsearch on SafetyWing Runbooks</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://runbooks.safetywing.dev/runbooks/elasticsearch/index.xml" rel="self" type="application/rss+xml"/><item><title>ElasticsearchClusterRed</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusterred/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusterred/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Elasticsearch cluster health is RED: at least one &lt;strong&gt;primary&lt;/strong&gt; shard is unassigned, so part of the index data is unavailable and writes to affected indices fail. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label, only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;elasticsearch_cluster_health_status{color&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;red&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 5m&lt;/code&gt;, severity &lt;code&gt;page&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Search and indexing for the affected indices is &lt;strong&gt;down&lt;/strong&gt;. This includes logging/observability data streams and application search indices (e.g. &lt;code&gt;sw_user&lt;/code&gt;, &lt;code&gt;sw_company&lt;/code&gt;, &lt;code&gt;sw_company_member&lt;/code&gt;). Any service that reads or writes those indices will see errors or empty results until the primaries are reassigned.&lt;/p&gt;</description></item><item><title>ElasticsearchClusterYellow</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusteryellow/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchclusteryellow/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The Elasticsearch cluster health is YELLOW: all primary shards are assigned, but one or more &lt;strong&gt;replica&lt;/strong&gt; shards are unassigned. Data is fully available, but redundancy is reduced. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label, only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;elasticsearch_cluster_health_status{color&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;yellow&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 30m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;No outage. Reads and writes continue to work. The risk is reduced fault tolerance: if a node holding a primary now fails, the cluster could go RED because there is no replica to promote. Performance for read-heavy indices may also drop while replicas are missing.&lt;/p&gt;</description></item><item><title>ElasticsearchDiskWatermark</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchdiskwatermark/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchdiskwatermark/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Free disk space on at least one Elasticsearch data node has dropped below 15%, approaching the &lt;strong&gt;flood-stage&lt;/strong&gt; watermark (default 95% used). At flood stage Elasticsearch makes indices on the affected node &lt;strong&gt;read-only&lt;/strong&gt; to protect the disk. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label, only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;min&lt;/span&gt;&lt;span style="color:#f92672"&gt;(&lt;/span&gt;elasticsearch_filesystem_data_available_bytes &lt;span style="color:#f92672"&gt;/&lt;/span&gt; elasticsearch_filesystem_data_size_bytes&lt;span style="color:#f92672"&gt;)&lt;/span&gt; &lt;span style="color:#f92672"&gt;&amp;lt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0.15&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;As watermarks are crossed, ES stops allocating new shards to the node (high watermark, can cause YELLOW), and at flood stage applies the &lt;code&gt;index.blocks.read_only_allow_delete&lt;/code&gt; block — &lt;strong&gt;writes to affected indices fail&lt;/strong&gt; while reads continue. Logging/observability ingestion and application indices stop accepting new data until disk is freed and the block is cleared.&lt;/p&gt;</description></item><item><title>ElasticsearchHeapHigh</title><link>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchheaphigh/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://runbooks.safetywing.dev/runbooks/elasticsearch/elasticsearchheaphigh/</guid><description>&lt;h2 id="meaning"&gt;Meaning&lt;a class="anchor" href="#meaning"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;JVM heap usage on an Elasticsearch node has been above 90% of max heap for a sustained period. The &lt;code&gt;name&lt;/code&gt; label identifies the node. Persistent heap pressure causes frequent/long GC pauses, slow responses, and can destabilize or OOM the node. This is a cluster-wide platform alert and carries no &lt;code&gt;environment&lt;/code&gt; label, only &lt;code&gt;cluster&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Fires when:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;elasticsearch_jvm_memory_used_bytes{area&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;heap&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;/&lt;/span&gt; elasticsearch_jvm_memory_max_bytes{area&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&amp;#34;&lt;span style="color:#e6db74"&gt;heap&lt;/span&gt;&amp;#34;} &lt;span style="color:#f92672"&gt;&amp;gt;&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0.90&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;for: 15m&lt;/code&gt;, severity &lt;code&gt;ticket&lt;/code&gt;, tier &lt;code&gt;platform&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="impact"&gt;Impact&lt;a class="anchor" href="#impact"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Degraded performance across the cluster: increased query/index latency, GC stop-the-world pauses, and risk of the affected node dropping out (which can cascade to YELLOW/RED). No immediate data loss while the alert is just heap pressure.&lt;/p&gt;</description></item></channel></rss>