MysqlInstanceDown • SafetyWing Runbooks

Meaning#

A MySQL instance’s mysqld_exporter sidecar reports the server unreachable, so the mysqld process is down or not accepting connections.

Fires when:

max by (pod) (mysql_up{namespace="safetywing-<env>-infra"}) == 0

for: 5m, severity page, tier component.

Impact#

The affected instance serves no queries.
If the primary is down, MOCO must fail over before writes can resume; expect a short write outage.
If a replica is down, read capacity and replication redundancy are reduced.

Diagnosis#

kubectl config use-context hetzner

# Cluster + member roles (which pod is primary vs replica)
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl moco status -n safetywing-<env>-infra <cluster>

# Pod state and recent events (OOMKill, evictions, probe failures)
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/name=mysql -o wide
kubectl describe pod -n safetywing-<env>-infra <pod>

# Container logs — mysqld, the MOCO agent, and the exporter
kubectl logs -n safetywing-<env>-infra <pod> -c mysqld --tail=200
kubectl logs -n safetywing-<env>-infra <pod> -c agent --tail=200
kubectl logs -n safetywing-<env>-infra <pod> -c mysqld-exporter --tail=100

# Confirm which pod(s) are down
max by (pod) (mysql_up{namespace="safetywing-<env>-infra"})

Mitigation#

Check pod events for the root cause: OOMKilled (raise memory limits in the MySQLCluster .spec.podTemplate), node pressure/eviction, or failed PVC mount.
If disk is full, mysqld will refuse to start — see MysqlDiskFillingUp and expand the PVC first.
If the process crashed but the pod is healthy, restart it:
```
kubectl delete pod -n safetywing-<env>-infra <pod>
```
MOCO recreates the pod; verify it rejoins via kubectl moco status.
If the primary is down and not recovering, let MOCO fail over to a healthy replica; confirm a new primary was elected in kubectl moco status. Investigate the old primary before reintroducing it.

If MOCO cannot reconcile, inspect the operator:

kubectl logs -n moco-system deploy/moco-controller-manager --tail=200

Meaning#

Impact#

Diagnosis#

Mitigation#

References#