Meaning#

A MySQL replica is applying the primary’s binlog stream slower than it is produced, so its Seconds_Behind_Master has exceeded the threshold.

Fires when:

max by (pod) (
  mysql_slave_status_seconds_behind_master{namespace="safetywing-<env>-infra"}
) > <seconds>

for: 10m, severity ticket, tier component.

Impact#

  • Reads served by the lagging replica return stale data.
  • A failover to a lagging replica could lose recent writes (MOCO uses semi-sync, which bounds but does not eliminate this risk under degraded conditions).

Diagnosis#

kubectl config use-context hetzner
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl moco status -n safetywing-<env>-infra <cluster>

# Replication status on the lagging replica
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin --index <n> <cluster> -- \
  -e "SHOW REPLICA STATUS\G"

# Container logs of the replica (mysqld + agent)
kubectl logs -n safetywing-<env>-infra <replica-pod> -c mysqld --tail=200
kubectl logs -n safetywing-<env>-infra <replica-pod> -c agent --tail=200

# Long-running transactions on the primary that bloat the binlog
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
  -e "SELECT * FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 10;"
# Lag per pod
max by (pod) (mysql_slave_status_seconds_behind_master{namespace="safetywing-<env>-infra"})

In SHOW REPLICA STATUS, check Replica_IO_Running / Replica_SQL_Running (both should be Yes), Last_Error, and Seconds_Behind_Master.

Mitigation#

  1. Broken replication (Replica_SQL_Running: No with Last_Error): inspect the error. For a recoverable conflict, MOCO normally re-clones; otherwise let MOCO reinitialize the replica (delete the pod/PVC so it re-bootstraps from the primary):
    kubectl delete pod -n safetywing-<env>-infra <replica-pod>
  2. IO thread stalled / network: confirm the replica can reach the primary; check node/network saturation between Hetzner nodes.
  3. Apply backlog from heavy writes: identify the write source (large batch jobs, migrations) and throttle or reschedule it; lag should drain once write volume drops.
  4. Long transactions on the primary holding the binlog: end the offending transaction (coordinate with the owning service).
  5. If a replica is persistently unable to catch up and is corrupt, remove and let MOCO re-provision it from the primary.

References#