ElasticsearchClusterRed • SafetyWing Runbooks

Meaning#

The Elasticsearch cluster health is RED: at least one primary shard is unassigned, so part of the index data is unavailable and writes to affected indices fail. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_cluster_health_status{color="red"} == 1

for: 5m, severity page, tier platform.

Impact#

Search and indexing for the affected indices is down. This includes logging/observability data streams and application search indices (e.g. sw_user, sw_company, sw_company_member). Any service that reads or writes those indices will see errors or empty results until the primaries are reassigned.

Diagnosis#

ES REST API (key in Vault kv/global/elasticsearch, exposed as ES_URL/ES_API_KEY):

# Overall health and which indices are red
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/health?pretty"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/health?level=indices&pretty"

# Unassigned shards (look for p = primary in the prirep column)
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/shards?v" | grep -i UNASSIGNED

# Why is a specific primary unassigned?
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/allocation/explain?pretty"

# Node health and disk
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/nodes?v&h=name,heap.percent,disk.used_percent,master"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/indices?v" | grep -i red

Cluster / operator side (kubectl config use-context hetzner):

kubectl get elasticsearch -n elastic
kubectl get pods -n elastic -l common.k8s.elastic.co/type=elasticsearch
kubectl describe elasticsearch -n elastic
kubectl logs -n elastic <es-pod> --tail=200
kubectl get pvc -n elastic
kubectl exec -n elastic <es-pod> -- df -h /usr/share/elasticsearch/data

PromQL:

elasticsearch_cluster_health_status{color="red"} == 1
elasticsearch_cluster_health_unassigned_shards
min(elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes)

Mitigation#

Run _cluster/allocation/explain to identify the unassigned primary and the reason. The most common causes are a downed node, disk past the flood-stage watermark, or a corrupted shard.
If a node is down or NotReady, recover it. Check ES pod status and logs; a crash-looping pod (OOM, disk full) blocks allocation. Let ECK reschedule the pod, or kubectl delete pod -n elastic <es-pod> to force a restart once the underlying cause is fixed.
If disk is the cause (node past flood-stage), free space or expand the PVC — see ElasticsearchDiskWatermark. Allocation resumes once the node drops below the watermark.
If a primary is genuinely lost (no replica, node gone for good), the data may need to be restored from snapshot. Only as a last resort, allocate a stale/empty primary via _cluster/reroute (allocate_stale_primary / allocate_empty_primary) — this accepts data loss.
After primaries are assigned, health returns to YELLOW (replicas re-allocating) then GREEN. Watch with _cluster/health.
If affected indices are growing unbounded, review ILM so this does not recur (rollover/delete policies on the data streams).

References#

ECK operator docs: https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html
Cluster health API: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
Fix a red or yellow cluster: https://www.elastic.co/guide/en/elasticsearch/reference/current/red-yellow-cluster-status.html
Allocation explain API: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html