Meaning#
Fewer Kafka Connect workers are reporting metrics than the number of replicas the kafka-cdc chart expects, indicating one or more Connect pods are down, crash-looping, or not scraping. Fires when:
count(kafka_connect_worker_metrics_connector_count{namespace="safetywing-<env>-infra"}) < <connect.replicas>for: 5m, severity page, tier component.
Impact#
Reduced Connect capacity and resilience for the CDC pipeline. Tasks owned by the missing worker are rebalanced onto survivors (added load, possible throughput drop and lag); if the cluster is at one replica, CDC from MOCO MySQL into Kafka is fully stopped and downstream consumers (search indices, mirror tables, event flows) stop receiving DB changes.
Diagnosis#
Compare desired vs. running workers and inspect the deployment.
kubectl config use-context hetzner
kubectl get kafkaconnect,kafkaconnector -n safetywing-<env>-infra
# Expected replicas (spec.replicas) vs. ready pods
kubectl get kafkaconnect -n safetywing-<env>-infra -o yaml | grep -i replicas
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/kind=KafkaConnect -o wideLook at why a pod is not Ready (CrashLoopBackOff, OOMKilled, pending/unschedulable, image pull).
kubectl describe pod -n safetywing-<env>-infra <connect-pod>
kubectl logs -n safetywing-<env>-infra <connect-pod> --previous --tail=200Check the Strimzi KafkaConnect CR conditions for reconcile errors.
kubectl get kafkaconnector <name> -n safetywing-<env>-infra -o yaml # status.conditions / tasksMaxConfirm the live worker count in Prometheus (prom-ep.hetzner.safetywing.dev).
count(kafka_connect_worker_metrics_connector_count{namespace="safetywing-<env>-infra"})Mitigation#
If a pod is
OOMKilledor evicted, check node pressure and bump Connect resource requests/limits in thekafka-cdcchart, then reconcile via ArgoCD.If pods are
Pending/unschedulable, free or add Talos node capacity and verify PVCs/affinity.If the worker count is below intended capacity due to a scale-down, scale Connect back up via the chart value
connect.replicas(do notkubectl scaledirectly — Strimzi manages the deployment; change the CR/chart):kubectl get kafkaconnect -n safetywing-<env>-infra -o jsonpath='{.items[0].spec.replicas}{"\n"}'For a crash-looping worker, read
--previouslogs for the failure (bad config, missing secret/credentials, broker connectivity) and fix the root cause; let Strimzi recreate the pod.Verify Kafka brokers are healthy — workers that can’t reach the bootstrap servers will fail to start.
References#
- Strimzi — Using Kafka Connect / Scaling Connect
- Debezium — Deployment on Strimzi