MysqlReplicationLagHigh • SafetyWing Runbooks

Meaning#

A MySQL replica is applying the primary’s binlog stream slower than it is produced, so its Seconds_Behind_Master has exceeded the threshold.

Fires when:

max by (pod) (
  mysql_slave_status_seconds_behind_master{namespace="safetywing-<env>-infra"}
) > <seconds>

for: 10m, severity ticket, tier component.

Impact#

Reads served by the lagging replica return stale data.
A failover to a lagging replica could lose recent writes (MOCO uses semi-sync, which bounds but does not eliminate this risk under degraded conditions).

Diagnosis#

kubectl config use-context hetzner
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl moco status -n safetywing-<env>-infra <cluster>

# Replication status on the lagging replica
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin --index <n> <cluster> -- \
  -e "SHOW REPLICA STATUS\G"

# Container logs of the replica (mysqld + agent)
kubectl logs -n safetywing-<env>-infra <replica-pod> -c mysqld --tail=200
kubectl logs -n safetywing-<env>-infra <replica-pod> -c agent --tail=200

# Long-running transactions on the primary that bloat the binlog
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
  -e "SELECT * FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 10;"

# Lag per pod
max by (pod) (mysql_slave_status_seconds_behind_master{namespace="safetywing-<env>-infra"})

In SHOW REPLICA STATUS, check Replica_IO_Running / Replica_SQL_Running (both should be Yes), Last_Error, and Seconds_Behind_Master.

Mitigation#

Broken replication (Replica_SQL_Running: No with Last_Error): inspect the error. For a recoverable conflict, MOCO normally re-clones; otherwise let MOCO reinitialize the replica (delete the pod/PVC so it re-bootstraps from the primary):
```
kubectl delete pod -n safetywing-<env>-infra <replica-pod>
```
IO thread stalled / network: confirm the replica can reach the primary; check node/network saturation between Hetzner nodes.
Apply backlog from heavy writes: identify the write source (large batch jobs, migrations) and throttle or reschedule it; lag should drain once write volume drops.
Long transactions on the primary holding the binlog: end the offending transaction (coordinate with the owning service).
If a replica is persistently unable to catch up and is corrupt, remove and let MOCO re-provision it from the primary.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#