MysqlDiskFillingUp • SafetyWing Runbooks

Meaning#

A mysql-data-* PersistentVolumeClaim is running low on free space. If it fills completely, mysqld will fail writes and may refuse to start.

Fires when:

min by (persistentvolumeclaim) (
  kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"mysql-data-.*"}
  / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"mysql-data-.*"}
) < (1 - <ratio>)

for: 15m, severity ticket, tier component.

Impact#

A full data volume causes write errors and can crash the instance (MysqlInstanceDown).
Common culprits: accumulated binary logs, an oversized dataset, or relay logs/temp files on a lagging replica.

Diagnosis#

kubectl config use-context hetzner
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl get pvc -n safetywing-<env>-infra

# Inspect on-disk usage from inside the mysqld container
kubectl exec -n safetywing-<env>-infra <pod> -c mysqld -- df -h /var/lib/mysql
kubectl exec -n safetywing-<env>-infra <pod> -c mysqld -- \
  sh -c 'du -sh /var/lib/mysql/* | sort -h | tail -20'

# Binary log inventory
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
  -e "SHOW BINARY LOGS;"
# Largest tables
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
  -e "SELECT table_schema, table_name, ROUND((data_length+index_length)/1024/1024) AS mb FROM information_schema.tables ORDER BY mb DESC LIMIT 20;"

# Fraction free per PVC
min by (persistentvolumeclaim) (
  kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"mysql-data-.*"}
  / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"mysql-data-.*"}
)

Mitigation#

Expand the PVC (preferred — the StorageClass supports volume expansion). Increase the volume request in the MySQLCluster volumeClaimTemplates; MOCO/Kubernetes resizes the PVC online:
```
spec:
  volumeClaimTemplates:
    - metadata:
        name: mysql-data
      spec:
        resources:
          requests:
            storage: <larger-size>
```
Apply via GitOps, then confirm with kubectl get pvc -n safetywing-<env>-infra.
Prune binary logs if they dominate usage (only purge logs already applied by all replicas):
```
PURGE BINARY LOGS BEFORE NOW() - INTERVAL 1 DAY;
```
For a durable fix, tune binlog_expire_logs_seconds in the cluster MySQL config.
Replica behind: a lagging replica accumulates relay logs — clearing the lag (MysqlReplicationLagHigh) lets them be purged.
Reclaim table space: drop unused data or run OPTIMIZE TABLE on bloated tables (note: requires temporary extra space, so resize first if very full).

Meaning#

Impact#

Diagnosis#

Mitigation#

References#