ElasticsearchClusterYellow • SafetyWing Runbooks

Meaning#

The Elasticsearch cluster health is YELLOW: all primary shards are assigned, but one or more replica shards are unassigned. Data is fully available, but redundancy is reduced. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_cluster_health_status{color="yellow"} == 1

for: 30m, severity ticket, tier platform.

Impact#

No outage. Reads and writes continue to work. The risk is reduced fault tolerance: if a node holding a primary now fails, the cluster could go RED because there is no replica to promote. Performance for read-heavy indices may also drop while replicas are missing.

Diagnosis#

ES REST API (key in Vault kv/global/elasticsearch, exposed as ES_URL/ES_API_KEY):

curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/health?pretty"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/health?level=indices&pretty"

# Unassigned replicas (prirep column = r)
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/shards?v" | grep -i UNASSIGNED

# Why a given replica is unassigned
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/allocation/explain?pretty"

curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/nodes?v&h=name,heap.percent,disk.used_percent"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/indices?v"

Cluster / operator side (kubectl config use-context hetzner):

kubectl get elasticsearch -n elastic
kubectl get pods -n elastic -l common.k8s.elastic.co/type=elasticsearch
kubectl get pvc -n elastic

PromQL:

elasticsearch_cluster_health_status{color="yellow"} == 1
elasticsearch_cluster_health_unassigned_shards
elasticsearch_cluster_health_number_of_nodes

Mitigation#

YELLOW is usually transient and self-heals as ES re-allocates replicas after a node restart or rolling update. If a node recently restarted, give it a few minutes and re-check _cluster/health.
Confirm the expected number of data nodes are present (_cat/nodes, elasticsearch_cluster_health_number_of_nodes). A single-node nodeSet cannot allocate replicas at all — in that case YELLOW is expected and the replica count should be 0 for those indices.
If replicas stay unassigned, run _cluster/allocation/explain. Common causes: disk past the high watermark (ES will not place new shards on a full node — see ElasticsearchDiskWatermark), or index.routing.allocation rules preventing placement.
Free disk or expand PVCs if a node is over the high watermark; replicas allocate automatically once it drops below.
If an index requests more replicas than there are data nodes, lower the replica count or scale the nodeSets in the Elasticsearch CR.
Review ILM if oversized/old indices are consuming the disk that blocks replica placement.

References#

ECK operator docs: https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html
Cluster health API: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
Fix a red or yellow cluster: https://www.elastic.co/guide/en/elasticsearch/reference/current/red-yellow-cluster-status.html
Allocation explain API: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html