Elasticsearch on SafetyWing Runbooks

ElasticsearchClusterRed

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Elasticsearch cluster health is RED: at least one primary shard is unassigned, so part of the index data is unavailable and writes to affected indices fail. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_cluster_health_status{color="red"} == 1

for: 5m, severity page, tier platform.

Impact#

Search and indexing for the affected indices is down. This includes logging/observability data streams and application search indices (e.g. sw_user, sw_company, sw_company_member). Any service that reads or writes those indices will see errors or empty results until the primaries are reassigned.

ElasticsearchClusterYellow

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Elasticsearch cluster health is YELLOW: all primary shards are assigned, but one or more replica shards are unassigned. Data is fully available, but redundancy is reduced. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_cluster_health_status{color="yellow"} == 1

for: 30m, severity ticket, tier platform.

Impact#

No outage. Reads and writes continue to work. The risk is reduced fault tolerance: if a node holding a primary now fails, the cluster could go RED because there is no replica to promote. Performance for read-heavy indices may also drop while replicas are missing.

ElasticsearchDiskWatermark

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Free disk space on at least one Elasticsearch data node has dropped below 15%, approaching the flood-stage watermark (default 95% used). At flood stage Elasticsearch makes indices on the affected node read-only to protect the disk. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

min(elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes) < 0.15

for: 15m, severity ticket, tier platform.

Impact#

As watermarks are crossed, ES stops allocating new shards to the node (high watermark, can cause YELLOW), and at flood stage applies the index.blocks.read_only_allow_delete block — writes to affected indices fail while reads continue. Logging/observability ingestion and application indices stop accepting new data until disk is freed and the block is cleared.

ElasticsearchHeapHigh

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

JVM heap usage on an Elasticsearch node has been above 90% of max heap for a sustained period. The name label identifies the node. Persistent heap pressure causes frequent/long GC pauses, slow responses, and can destabilize or OOM the node. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.90

for: 15m, severity ticket, tier platform.

Impact#

Degraded performance across the cluster: increased query/index latency, GC stop-the-world pauses, and risk of the affected node dropping out (which can cascade to YELLOW/RED). No immediate data loss while the alert is just heap pressure.