ElasticsearchDiskWatermark • SafetyWing Runbooks

Meaning#

Free disk space on at least one Elasticsearch data node has dropped below 15%, approaching the flood-stage watermark (default 95% used). At flood stage Elasticsearch makes indices on the affected node read-only to protect the disk. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

min(elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes) < 0.15

for: 15m, severity ticket, tier platform.

Impact#

As watermarks are crossed, ES stops allocating new shards to the node (high watermark, can cause YELLOW), and at flood stage applies the index.blocks.read_only_allow_delete block — writes to affected indices fail while reads continue. Logging/observability ingestion and application indices stop accepting new data until disk is freed and the block is cleared.

Diagnosis#

ES REST API (key in Vault kv/global/elasticsearch, exposed as ES_URL/ES_API_KEY):

# Per-node disk usage
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/nodes?v&h=name,disk.used_percent,disk.used,disk.avail,disk.total"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/allocation?v"

# Largest indices
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cat/indices?v&s=store.size:desc"

# Health + any read-only blocks already applied
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_cluster/health?pretty"
curl -s -H "Authorization: ApiKey $ES_API_KEY" "$ES_URL/_all/_settings/index.blocks.read_only_allow_delete?pretty"

Cluster / operator side (kubectl config use-context hetzner):

kubectl get elasticsearch -n elastic
kubectl get pods -n elastic -l common.k8s.elastic.co/type=elasticsearch
kubectl get pvc -n elastic
kubectl exec -n elastic <es-pod> -- df -h /usr/share/elasticsearch/data

PromQL:

min(elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes)
elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes

Mitigation#

Free space first. Delete or roll over old indices/data streams via ILM, or delete obsolete indices: DELETE $ES_URL/<index>. Confirm space recovered with _cat/allocation.
If data is needed but the node is full, expand the PVC: edit the volumeClaimTemplates storage request in the Elasticsearch CR nodeSets (the StorageClass must allow volume expansion); ECK and the CSI driver grow the PVC. Verify with kubectl get pvc -n elastic and df -h in the pod.
Add a data node to the nodeSets if the cluster as a whole is undersized; shards rebalance off the full node.

Once disk is back below the flood-stage watermark, clear the read-only block so writes resume:

curl -s -X PUT -H "Authorization: ApiKey $ES_API_KEY" -H 'Content-Type: application/json' \
  "$ES_URL/_all/_settings" \
  -d '{"index.blocks.read_only_allow_delete": null}'

Confirm writes work again and health returns to GREEN.
Fix the root cause: tune ILM rollover/delete policies so indices do not grow unbounded and refill the disk.

References#

ECK operator docs / volume expansion: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-volume-claim-templates.html
Disk-based shard allocation (watermarks): https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#disk-based-shard-allocation
Fix watermark errors: https://www.elastic.co/guide/en/elasticsearch/reference/current/fix-watermark-errors.html
ILM overview: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html