KafkaConnectWorkersDown • SafetyWing Runbooks

Meaning#

Fewer Kafka Connect workers are reporting metrics than the number of replicas the kafka-cdc chart expects, indicating one or more Connect pods are down, crash-looping, or not scraping. Fires when:

count(kafka_connect_worker_metrics_connector_count{namespace="safetywing-<env>-infra"}) < <connect.replicas>

for: 5m, severity page, tier component.

Impact#

Reduced Connect capacity and resilience for the CDC pipeline. Tasks owned by the missing worker are rebalanced onto survivors (added load, possible throughput drop and lag); if the cluster is at one replica, CDC from MOCO MySQL into Kafka is fully stopped and downstream consumers (search indices, mirror tables, event flows) stop receiving DB changes.

Diagnosis#

Compare desired vs. running workers and inspect the deployment.

kubectl config use-context hetzner

kubectl get kafkaconnect,kafkaconnector -n safetywing-<env>-infra

# Expected replicas (spec.replicas) vs. ready pods
kubectl get kafkaconnect -n safetywing-<env>-infra -o yaml | grep -i replicas
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/kind=KafkaConnect -o wide

Look at why a pod is not Ready (CrashLoopBackOff, OOMKilled, pending/unschedulable, image pull).

kubectl describe pod -n safetywing-<env>-infra <connect-pod>
kubectl logs -n safetywing-<env>-infra <connect-pod> --previous --tail=200

Check the Strimzi KafkaConnect CR conditions for reconcile errors.

kubectl get kafkaconnector <name> -n safetywing-<env>-infra -o yaml   # status.conditions / tasksMax

Confirm the live worker count in Prometheus (prom-ep.hetzner.safetywing.dev).

count(kafka_connect_worker_metrics_connector_count{namespace="safetywing-<env>-infra"})

Mitigation#

If a pod is OOMKilled or evicted, check node pressure and bump Connect resource requests/limits in the kafka-cdc chart, then reconcile via ArgoCD.
If pods are Pending/unschedulable, free or add Talos node capacity and verify PVCs/affinity.
If the worker count is below intended capacity due to a scale-down, scale Connect back up via the chart value connect.replicas (do not kubectl scale directly — Strimzi manages the deployment; change the CR/chart):
```
kubectl get kafkaconnect -n safetywing-<env>-infra -o jsonpath='{.items[0].spec.replicas}{"\n"}'
```
For a crash-looping worker, read --previous logs for the failure (bad config, missing secret/credentials, broker connectivity) and fix the root cause; let Strimzi recreate the pod.
Verify Kafka brokers are healthy — workers that can’t reach the bootstrap servers will fail to start.

References#

Strimzi — Using Kafka Connect / Scaling Connect
Debezium — Deployment on Strimzi