Meaning#

In KRaft mode exactly one controller node should be active (the metadata quorum leader). This alert means the cluster sees zero active controllers (no quorum leader) or more than one (split brain). Either state puts cluster metadata — topic, partition, ISR, and config state — at risk and blocks administrative operations.

Fires when: the summed active controller count across the namespace is not exactly 1 for 5m. Severity page, tier component.

sum(kafka_controller_kafkacontroller_activecontrollercount{namespace="safetywing-<env>-infra"}) != 1

Impact#

  • No leader for the KRaft metadata quorum → topic creation/deletion, partition reassignment, and config changes fail.
  • Leader elections for data partitions may stall, risking offline/under-replicated partitions.
  • A count > 1 indicates a quorum split and inconsistent metadata views.

Diagnosis#

kubectl config use-context hetzner

# Strimzi CRs and node pools (controller vs broker roles)
kubectl get kafka,kafkanodepool -n safetywing-<env>-infra
kubectl get kafkanodepool -n safetywing-<env>-infra -o custom-columns=\
NAME:.metadata.name,ROLES:.spec.roles,REPLICAS:.spec.replicas

# Controller / dual-role pods
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/cluster -o wide

# Logs: quorum, raft, controller election
kubectl logs -n safetywing-<env>-infra <controller-pod> --tail=300 | grep -iE "quorum|controller|raft|election"

# KRaft metadata quorum state from a node
kubectl exec -n safetywing-<env>-infra <controller-pod> -- \
  bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:9093 describe --status
kubectl exec -n safetywing-<env>-infra <controller-pod> -- \
  bin/kafka-metadata-quorum.sh --bootstrap-controller localhost:9093 describe --replication

Confirm in Prometheus (prom-ep.hetzner.safetywing.dev):

sum(kafka_controller_kafkacontroller_activecontrollercount{namespace="safetywing-<env>-infra"})
kafka_controller_kafkacontroller_activecontrollercount{namespace="safetywing-<env>-infra"}

Mitigation#

  1. Determine whether the count is 0 (no leader) or >1 (split):
    • 0: the controller quorum lacks a majority. Count how many controller-role nodes are Ready; the quorum needs a majority up (e.g. 2 of 3).
    • >1: likely a transient election overlap or network partition between controllers; usually self-resolves once connectivity is stable.
  2. Restore controller pods to restore quorum:
    kubectl describe pod -n safetywing-<env>-infra <controller-pod>
    kubectl delete pod -n safetywing-<env>-infra <controller-pod>   # Strimzi recreates
  3. Check controller PVCs — a corrupt/full __cluster_metadata log keeps a node out of the quorum:
    kubectl get pvc -n safetywing-<env>-infra
  4. Inspect inter-node networking (Talos node health, CNI) if controllers are up but cannot form a quorum.
  5. Watch quorum recovery with kafka-metadata-quorum.sh ... describe --status; the LeaderId should settle on a single node.
  6. Do not delete or reset the metadata log directory — that destroys cluster metadata. Escalate before any KRaft storage recovery action.

References#