Meaning#
JVM heap usage on an Elasticsearch node has been above 90% of max heap for a sustained period. The name label identifies the node. Persistent heap pressure causes frequent/long GC pauses, slow responses, and can destabilize or OOM the node. This is a cluster-wide platform alert and carries no environment label, only cluster.
Fires when:
elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.90for: 15m, severity ticket, tier platform.
Impact#
Degraded performance across the cluster: increased query/index latency, GC stop-the-world pauses, and risk of the affected node dropping out (which can cascade to YELLOW/RED). No immediate data loss while the alert is just heap pressure.
Diagnosis#
ES REST API (key in Vault kv/global/elasticsearch, exposed as ES_URL/ES_API_KEY):
# Per-node heap %
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/nodes?v&h=name,heap.percent,heap.current,heap.max,ram.percent,cpu,load_1m"
# Where heap is going: fielddata, segments, query cache, etc.
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_nodes/stats/jvm,indices?pretty"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/fielddata?v&s=size:desc"
# Expensive / long-running tasks
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_tasks?detailed&actions=*search*&pretty"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/indices?v&s=store.size:desc"Cluster / operator side (kubectl config use-context hetzner):
kubectl get elasticsearch -n elastic -o yaml | grep -A30 nodeSets # heap / resources
kubectl get pods -n elastic -l common.k8s.elastic.co/type=elasticsearch
kubectl top pods -n elastic
kubectl logs -n elastic <es-pod> --tail=200 | grep -i 'gc\|OutOfMemory'PromQL:
elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}
rate(elasticsearch_jvm_gc_collection_seconds_sum[5m])Mitigation#
- Identify the hot node (
namelabel /_cat/nodes) and what is consuming heap — large fielddata, many shards/segments, or expensive aggregations. - Reduce load: throttle or stop heavy/expensive queries (aggregations on high-cardinality text fields, deep pagination). Cancel runaway tasks via
_tasks/<id>/_cancel. - Clear fielddata pressure if that is the cause:
POST $ES_URL/<index>/_cache/clear?fielddata=true, and avoid sorting/aggregating on analyzed text fields. - Reduce shard count — too many shards per node inflates heap. Use ILM rollover and shrink/delete old indices.
- If load is legitimate, scale up: increase JVM heap and container memory, or add data nodes, by editing the
nodeSetsin theElasticsearchCR. ECK performs a rolling change. Keep heap ≤ 50% of container memory and under ~31 GB (compressed oops). - Review ILM so index/shard growth does not push heap back up.
References#
- ECK operator docs: https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html
- ECK managing compute resources / JVM heap: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-managing-compute-resources.html
- High JVM memory pressure: https://www.elastic.co/guide/en/elasticsearch/reference/current/high-jvm-memory-pressure.html