Meaning#
Partitions have fewer in-sync replicas than their configured replication factor. The cluster is still serving traffic, but durability is reduced — losing one more broker could take partitions offline or lose data. Usually a broker is down, restarting, or lagging behind on replication.
Fires when: any broker reports a non-zero under-replicated partition count for 10m. Severity ticket, tier component.
max(kafka_server_replicamanager_underreplicatedpartitions{namespace="safetywing-<env>-infra"}) > 0Impact#
- Reduced fault tolerance: a single additional broker failure may cause offline partitions or data loss.
- Producers using
acks=allmay slow down or block if the ISR drops belowmin.insync.replicas. - Sustained under-replication often precedes an
OfflinePartitionspage.
Diagnosis#
kubectl config use-context hetzner
# Strimzi CRs and broker pods
kubectl get kafka,kafkanodepool -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/cluster -o wide
# Any broker not Ready / restarting?
kubectl get pods -n safetywing-<env>-infra | grep -vE "Running|Completed"
# Which partitions are under-replicated / below min ISR
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --under-replicated-partitions
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
--describe --under-min-isr-partitions
# Broker logs: replica fetcher, ISR shrink/expand, disk
kubectl logs -n safetywing-<env>-infra <broker-pod> --tail=200 | grep -iE "ISR|replica|fetch"Confirm scope in Prometheus (prom-ep.hetzner.safetywing.dev):
kafka_server_replicamanager_underreplicatedpartitions{namespace="safetywing-<env>-infra"}Check per-topic ISR in Provectus kafka-ui (behind Google OIDC).
Mitigation#
- Find the broker(s) that left the ISR — almost always a pod that is down, restarting, or under resource pressure.
kubectl describe pod -n safetywing-<env>-infra <broker-pod> - If a broker is crash-looping or OOM, address the cause (memory limits, disk) and let it rejoin; Strimzi recreates deleted pods:
kubectl delete pod -n safetywing-<env>-infra <broker-pod> - If it is a slow/lagging follower (not down), check disk I/O and storage health (PVC / Rook-Ceph) — replication may simply be catching up after a restart.
kubectl get pvc -n safetywing-<env>-infra - Watch the ISR re-expand;
underreplicatedpartitionsshould trend to 0 once followers catch up. - If under-replication persists with all brokers Ready, inspect replica fetcher errors in logs and consider a partition reassignment to rebalance:
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \ bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --help - Avoid lowering
min.insync.replicasto silence the alert — that trades durability for green dashboards.