NodeFilesystemAlmostFull • SafetyWing Runbooks

Meaning#

A node filesystem is running low on free space. This is a supplemental platform-tier rule on top of the kube-prometheus-stack node-exporter mixin.

Fires when: available space on a non-ephemeral filesystem drops below the configured ratio.

(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs"}
  / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs"}) < <ratio>

for: 15m, severity ticket, tier platform (cluster-wide, no environment label). The offending filesystem is identified by the instance and mountpoint labels.

Impact#

A full filesystem on a node can wedge the kubelet, fail image pulls, block container log writes, evict pods (ephemeral-storage pressure), and on Talos can disrupt the system partition. If the node hosts stateful workloads (Ceph OSDs, MySQL/MOCO, RabbitMQ), data writes can stall or fail.

Diagnosis#

Identify the node and check pressure conditions:

kubectl config use-context hetzner
kubectl get nodes -o wide
kubectl describe node <node>   # look for DiskPressure / ephemeral-storage conditions and taints

Inspect the filesystem directly on Talos (no SSH — use talosctl):

talosctl -n <node-ip> df
talosctl -n <node-ip> usage -d 1 /var      # largest consumers under /var
talosctl -n <node-ip> usage -d 1 /var/lib  # containerd images, kubelet, etcd, etc.
talosctl -n <node-ip> logs kubelet         # eviction / disk-pressure messages
talosctl -n <node-ip> containers           # running containers on the node

Find big disk consumers from Kubernetes:

# Pods with the most ephemeral-storage usage on the node
kubectl get pods -A --field-selector spec.nodeName=<node> -o wide
kubectl top pods -A --field-selector spec.nodeName=<node>   # if metrics-server available

# Noisy container logs are a common cause (logs live under /var/log on the node)
kubectl get pods -A --field-selector spec.nodeName=<node> \
  -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'

PromQL to confirm and trend (Prometheus UI: prom-ep.hetzner.safetywing.dev):

# Current free ratio per mountpoint on the node
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs"}
  / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs"}

# Predict hours-to-full from the last 6h trend
predict_linear(node_filesystem_avail_bytes{instance="<instance>",mountpoint="<mountpoint>"}[6h], 4*3600)

Mitigation#

Confirm which mountpoint is full and whether it is growing or static (predict_linear above).
Reclaim image/log space: talosctl -n <node-ip> usage -d 1 /var/lib/containerd and prune unused images via the kubelet GC (it runs automatically under disk pressure) or restart the heaviest log-spamming pods.
Track down a runaway log producer (kubectl logs <pod> rate) and fix the log level or rotation; oversized container logs are the most common trigger.
For stateful storage (Ceph/MySQL/RabbitMQ), check whether a PVC or OSD is filling the disk — expand the volume or rebalance rather than deleting data.
If the node is in DiskPressure and critical, cordon and drain it to relieve pressure, then resize/clean before uncordoning.
Root causes: log explosion from a misbehaving app, accumulated container images, unbounded ephemeral-storage use, Ceph OSD imbalance, or an undersized node disk that needs growing.

References#

Talos docs — disk management: https://www.talos.dev/latest/talos-guides/configuration/disk-management/
talosctl reference: https://www.talos.dev/latest/reference/cli/
node-exporter mixin (filesystem alerts): https://github.com/prometheus-operator/kube-prometheus
Kubernetes node disk pressure / eviction: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/