KafkaConnectFailedTasks • SafetyWing Runbooks

Meaning#

One or more Debezium CDC connector tasks have entered the FAILED state, so change capture for the affected connector is degraded or fully stopped. Fires when:

max(kafka_connect_worker_metrics_connector_failed_task_count{namespace="safetywing-<env>-infra"}) > 0

for: 10m, severity page, tier component.

Impact#

CDC from MOCO MySQL into Kafka is stalled for the failed connector. Downstream consumers stop receiving database changes: search indices fall behind, derived/mirror tables go stale, and any event-driven flow fed by these topics no longer reflects new writes. Lag grows until the task is recovered.

Diagnosis#

List the Connect cluster and connectors, then inspect the failing one.

kubectl config use-context hetzner

kubectl get kafkaconnect,kafkaconnector -n safetywing-<env>-infra

# Identify connectors not in Ready/Running state
kubectl get kafkaconnector -n safetywing-<env>-infra -o yaml | less

# Drill into a specific connector: status.conditions, status.connectorStatus, tasksMax
kubectl get kafkaconnector <name> -n safetywing-<env>-infra -o yaml

Check Connect worker pod logs for the underlying exception (Debezium stack traces, MySQL connection errors, schema/offset issues).

kubectl get pods -n safetywing-<env>-infra -l strimzi.io/kind=KafkaConnect
kubectl logs -n safetywing-<env>-infra <connect-pod> --tail=200

Query the Connect REST API from inside a worker pod for per-task error traces.

kubectl exec -n safetywing-<env>-infra <connect-pod> -- \
  curl -s localhost:8083/connectors/<name>/status | jq

Confirm via Prometheus (prom-ep.hetzner.safetywing.dev) which connector is contributing the failed count.

kafka_connect_worker_metrics_connector_failed_task_count{namespace="safetywing-<env>-infra"} > 0

Mitigation#

From the REST API status, read the failed task’s trace to classify the root cause (transient vs. config/data).

Restart the failed task via the Strimzi annotation (preferred over editing the CR):

kubectl annotate kafkaconnector <name> -n safetywing-<env>-infra \
  strimzi.io/restart-task="<taskId>" --overwrite

To restart the whole connector (all tasks):

kubectl annotate kafkaconnector <name> -n safetywing-<env>-infra \
  strimzi.io/restart="true" --overwrite

If the failure is a source DB connectivity error, verify the MOCO MySQL cluster in the same namespace is reachable (pods Ready, service resolving, credentials valid):
```
kubectl get pods,svc -n safetywing-<env>-infra -l app.kubernetes.io/name=moco
```
If workers are saturated (OOM/CPU) causing task failures, scale Connect replicas in the kafka-cdc chart values (connect.replicas) and reconcile via ArgoCD.
Common root causes: MySQL binlog position/offset expired or purged, schema changes Debezium can’t handle, credential rotation, network partition to MySQL, or Kafka broker unavailability. Address the cause before repeated restarts (restarting a connector with a purged binlog will re-fail).

References#

Strimzi — Connector restart annotations / Using Kafka Connect
Debezium — MySQL connector and fault tolerance / troubleshooting