Meaning#
One or more Debezium CDC connector tasks have entered the FAILED state, so change capture for the affected connector is degraded or fully stopped. Fires when:
max(kafka_connect_worker_metrics_connector_failed_task_count{namespace="safetywing-<env>-infra"}) > 0for: 10m, severity page, tier component.
Impact#
CDC from MOCO MySQL into Kafka is stalled for the failed connector. Downstream consumers stop receiving database changes: search indices fall behind, derived/mirror tables go stale, and any event-driven flow fed by these topics no longer reflects new writes. Lag grows until the task is recovered.
Diagnosis#
List the Connect cluster and connectors, then inspect the failing one.
kubectl config use-context hetzner
kubectl get kafkaconnect,kafkaconnector -n safetywing-<env>-infra
# Identify connectors not in Ready/Running state
kubectl get kafkaconnector -n safetywing-<env>-infra -o yaml | less
# Drill into a specific connector: status.conditions, status.connectorStatus, tasksMax
kubectl get kafkaconnector <name> -n safetywing-<env>-infra -o yamlCheck Connect worker pod logs for the underlying exception (Debezium stack traces, MySQL connection errors, schema/offset issues).
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/kind=KafkaConnect
kubectl logs -n safetywing-<env>-infra <connect-pod> --tail=200Query the Connect REST API from inside a worker pod for per-task error traces.
kubectl exec -n safetywing-<env>-infra <connect-pod> -- \
curl -s localhost:8083/connectors/<name>/status | jqConfirm via Prometheus (prom-ep.hetzner.safetywing.dev) which connector is contributing the failed count.
kafka_connect_worker_metrics_connector_failed_task_count{namespace="safetywing-<env>-infra"} > 0Mitigation#
From the REST API status, read the failed task’s
traceto classify the root cause (transient vs. config/data).Restart the failed task via the Strimzi annotation (preferred over editing the CR):
kubectl annotate kafkaconnector <name> -n safetywing-<env>-infra \ strimzi.io/restart-task="<taskId>" --overwriteTo restart the whole connector (all tasks):
kubectl annotate kafkaconnector <name> -n safetywing-<env>-infra \ strimzi.io/restart="true" --overwriteIf the failure is a source DB connectivity error, verify the MOCO MySQL cluster in the same namespace is reachable (pods Ready, service resolving, credentials valid):
kubectl get pods,svc -n safetywing-<env>-infra -l app.kubernetes.io/name=mocoIf workers are saturated (OOM/CPU) causing task failures, scale Connect replicas in the
kafka-cdcchart values (connect.replicas) and reconcile via ArgoCD.Common root causes: MySQL binlog position/offset expired or purged, schema changes Debezium can’t handle, credential rotation, network partition to MySQL, or Kafka broker unavailability. Address the cause before repeated restarts (restarting a connector with a purged binlog will re-fail).
References#
- Strimzi — Connector restart annotations / Using Kafka Connect
- Debezium — MySQL connector and fault tolerance / troubleshooting