ElasticsearchHeapHigh • SafetyWing Runbooks

Meaning#

JVM heap usage on an Elasticsearch node has been above 90% of max heap for a sustained period. The name label identifies the node. Persistent heap pressure causes frequent/long GC pauses, slow responses, and can destabilize or OOM the node. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.90

for: 15m, severity ticket, tier platform.

Impact#

Degraded performance across the cluster: increased query/index latency, GC stop-the-world pauses, and risk of the affected node dropping out (which can cascade to YELLOW/RED). No immediate data loss while the alert is just heap pressure.

Diagnosis#

ES REST API (key in Vault kv/global/elasticsearch, exposed as ES_URL/ES_API_KEY):

# Per-node heap %
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/nodes?v&h=name,heap.percent,heap.current,heap.max,ram.percent,cpu,load_1m"

# Where heap is going: fielddata, segments, query cache, etc.
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_nodes/stats/jvm,indices?pretty"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/fielddata?v&s=size:desc"

# Expensive / long-running tasks
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_tasks?detailed&actions=*search*&pretty"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/indices?v&s=store.size:desc"

Cluster / operator side (kubectl config use-context hetzner):

kubectl get elasticsearch -n elastic -o yaml | grep -A30 nodeSets   # heap / resources
kubectl get pods -n elastic -l common.k8s.elastic.co/type=elasticsearch
kubectl top pods -n elastic
kubectl logs -n elastic <es-pod> --tail=200 | grep -i 'gc\|OutOfMemory'

PromQL:

elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}
rate(elasticsearch_jvm_gc_collection_seconds_sum[5m])

Mitigation#

Identify the hot node (name label / _cat/nodes) and what is consuming heap — large fielddata, many shards/segments, or expensive aggregations.
Reduce load: throttle or stop heavy/expensive queries (aggregations on high-cardinality text fields, deep pagination). Cancel runaway tasks via _tasks/<id>/_cancel.
Clear fielddata pressure if that is the cause: POST $ES_URL/<index>/_cache/clear?fielddata=true, and avoid sorting/aggregating on analyzed text fields.
Reduce shard count — too many shards per node inflates heap. Use ILM rollover and shrink/delete old indices.
If load is legitimate, scale up: increase JVM heap and container memory, or add data nodes, by editing the nodeSets in the Elasticsearch CR. ECK performs a rolling change. Keep heap ≤ 50% of container memory and under ~31 GB (compressed oops).
Review ILM so index/shard growth does not push heap back up.

References#

ECK operator docs: https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html
ECK managing compute resources / JVM heap: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html
High JVM memory pressure: https://www.elastic.co/guide/en/elasticsearch/reference/current/high-jvm-memory-pressure.html