Meaning#
The Elasticsearch cluster health is RED: at least one primary shard is unassigned, so part of the index data is unavailable and writes to affected indices fail. This is a cluster-wide platform alert and carries no environment label, only cluster.
Fires when:
elasticsearch_cluster_health_status{color="red"} == 1for: 5m, severity page, tier platform.
Impact#
Search and indexing for the affected indices is down. This includes logging/observability data streams and application search indices (e.g. sw_user, sw_company, sw_company_member). Any service that reads or writes those indices will see errors or empty results until the primaries are reassigned.
Diagnosis#
ES REST API (key in Vault kv/global/elasticsearch, exposed as ES_URL/ES_API_KEY):
# Overall health and which indices are red
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/health?pretty"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/health?level=indices&pretty"
# Unassigned shards (look for p = primary in the prirep column)
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/shards?v" | grep -i UNASSIGNED
# Why is a specific primary unassigned?
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/allocation/explain?pretty"
# Node health and disk
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/nodes?v&h=name,heap.percent,disk.used_percent,master"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/indices?v" | grep -i redCluster / operator side (kubectl config use-context hetzner):
kubectl get elasticsearch -n elastic
kubectl get pods -n elastic -l common.k8s.elastic.co/type=elasticsearch
kubectl describe elasticsearch -n elastic
kubectl logs -n elastic <es-pod> --tail=200
kubectl get pvc -n elastic
kubectl exec -n elastic <es-pod> -- df -h /usr/share/elasticsearch/dataPromQL:
elasticsearch_cluster_health_status{color="red"} == 1
elasticsearch_cluster_health_unassigned_shards
min(elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes)Mitigation#
- Run
_cluster/allocation/explainto identify the unassigned primary and the reason. The most common causes are a downed node, disk past the flood-stage watermark, or a corrupted shard. - If a node is down or
NotReady, recover it. Check ES pod status and logs; a crash-looping pod (OOM, disk full) blocks allocation. Let ECK reschedule the pod, orkubectl delete pod -n elastic <es-pod>to force a restart once the underlying cause is fixed. - If disk is the cause (node past flood-stage), free space or expand the PVC — see ElasticsearchDiskWatermark. Allocation resumes once the node drops below the watermark.
- If a primary is genuinely lost (no replica, node gone for good), the data may need to be restored from snapshot. Only as a last resort, allocate a stale/empty primary via
_cluster/reroute(allocate_stale_primary/allocate_empty_primary) — this accepts data loss. - After primaries are assigned, health returns to YELLOW (replicas re-allocating) then GREEN. Watch with
_cluster/health. - If affected indices are growing unbounded, review ILM so this does not recur (rollover/delete policies on the data streams).
References#
- ECK operator docs: https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html
- Cluster health API: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
- Fix a red or yellow cluster: https://www.elastic.co/guide/en/elasticsearch/reference/current/red-yellow-cluster-status.html
- Allocation explain API: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html