Meaning#
A MySQL replica is applying the primary’s binlog stream slower than it is produced, so its Seconds_Behind_Master has exceeded the threshold.
Fires when:
max by (pod) (
mysql_slave_status_seconds_behind_master{namespace="safetywing-<env>-infra"}
) > <seconds>for: 10m, severity ticket, tier component.
Impact#
- Reads served by the lagging replica return stale data.
- A failover to a lagging replica could lose recent writes (MOCO uses semi-sync, which bounds but does not eliminate this risk under degraded conditions).
Diagnosis#
kubectl config use-context hetzner
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl moco status -n safetywing-<env>-infra <cluster>
# Replication status on the lagging replica
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin --index <n> <cluster> -- \
-e "SHOW REPLICA STATUS\G"
# Container logs of the replica (mysqld + agent)
kubectl logs -n safetywing-<env>-infra <replica-pod> -c mysqld --tail=200
kubectl logs -n safetywing-<env>-infra <replica-pod> -c agent --tail=200
# Long-running transactions on the primary that bloat the binlog
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
-e "SELECT * FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 10;"# Lag per pod
max by (pod) (mysql_slave_status_seconds_behind_master{namespace="safetywing-<env>-infra"})In SHOW REPLICA STATUS, check Replica_IO_Running / Replica_SQL_Running (both should be Yes), Last_Error, and Seconds_Behind_Master.
Mitigation#
- Broken replication (
Replica_SQL_Running: NowithLast_Error): inspect the error. For a recoverable conflict, MOCO normally re-clones; otherwise let MOCO reinitialize the replica (delete the pod/PVC so it re-bootstraps from the primary):kubectl delete pod -n safetywing-<env>-infra <replica-pod> - IO thread stalled / network: confirm the replica can reach the primary; check node/network saturation between Hetzner nodes.
- Apply backlog from heavy writes: identify the write source (large batch jobs, migrations) and throttle or reschedule it; lag should drain once write volume drops.
- Long transactions on the primary holding the binlog: end the offending transaction (coordinate with the owning service).
- If a replica is persistently unable to catch up and is corrupt, remove and let MOCO re-provision it from the primary.