Meaning#
A node filesystem is running low on free space. This is a supplemental platform-tier rule on top of the kube-prometheus-stack node-exporter mixin.
Fires when: available space on a non-ephemeral filesystem drops below the configured ratio.
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs"}) < <ratio>for: 15m, severity ticket, tier platform (cluster-wide, no environment label). The offending filesystem is identified by the instance and mountpoint labels.
Impact#
A full filesystem on a node can wedge the kubelet, fail image pulls, block container log writes, evict pods (ephemeral-storage pressure), and on Talos can disrupt the system partition. If the node hosts stateful workloads (Ceph OSDs, MySQL/MOCO, RabbitMQ), data writes can stall or fail.
Diagnosis#
Identify the node and check pressure conditions:
kubectl config use-context hetzner
kubectl get nodes -o wide
kubectl describe node <node> # look for DiskPressure / ephemeral-storage conditions and taintsInspect the filesystem directly on Talos (no SSH — use talosctl):
talosctl -n <node-ip> df
talosctl -n <node-ip> usage -d 1 /var # largest consumers under /var
talosctl -n <node-ip> usage -d 1 /var/lib # containerd images, kubelet, etcd, etc.
talosctl -n <node-ip> logs kubelet # eviction / disk-pressure messages
talosctl -n <node-ip> containers # running containers on the nodeFind big disk consumers from Kubernetes:
# Pods with the most ephemeral-storage usage on the node
kubectl get pods -A --field-selector spec.nodeName=<node> -o wide
kubectl top pods -A --field-selector spec.nodeName=<node> # if metrics-server available
# Noisy container logs are a common cause (logs live under /var/log on the node)
kubectl get pods -A --field-selector spec.nodeName=<node> \
-o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'PromQL to confirm and trend (Prometheus UI: prom-ep.hetzner.safetywing.dev):
# Current free ratio per mountpoint on the node
node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs"}
# Predict hours-to-full from the last 6h trend
predict_linear(node_filesystem_avail_bytes{instance="<instance>",mountpoint="<mountpoint>"}[6h], 4*3600)Mitigation#
- Confirm which
mountpointis full and whether it is growing or static (predict_linearabove). - Reclaim image/log space:
talosctl -n <node-ip> usage -d 1 /var/lib/containerdand prune unused images via the kubelet GC (it runs automatically under disk pressure) or restart the heaviest log-spamming pods. - Track down a runaway log producer (
kubectl logs <pod>rate) and fix the log level or rotation; oversized container logs are the most common trigger. - For stateful storage (Ceph/MySQL/RabbitMQ), check whether a PVC or OSD is filling the disk — expand the volume or rebalance rather than deleting data.
- If the node is in DiskPressure and critical, cordon and drain it to relieve pressure, then resize/clean before uncordoning.
- Root causes: log explosion from a misbehaving app, accumulated container images, unbounded ephemeral-storage use, Ceph OSD imbalance, or an undersized node disk that needs growing.
References#
- Talos docs — disk management: https://www.talos.dev/latest/talos-guides/configuration/disk-management/
- talosctl reference: https://www.talos.dev/latest/reference/cli/
- node-exporter mixin (filesystem alerts): https://github.com/prometheus-operator/kube-prometheus
- Kubernetes node disk pressure / eviction: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/