Meaning#

Partitions have fewer in-sync replicas than their configured replication factor. The cluster is still serving traffic, but durability is reduced — losing one more broker could take partitions offline or lose data. Usually a broker is down, restarting, or lagging behind on replication.

Fires when: any broker reports a non-zero under-replicated partition count for 10m. Severity ticket, tier component.

max(kafka_server_replicamanager_underreplicatedpartitions{namespace="safetywing-<env>-infra"}) > 0

Impact#

  • Reduced fault tolerance: a single additional broker failure may cause offline partitions or data loss.
  • Producers using acks=all may slow down or block if the ISR drops below min.insync.replicas.
  • Sustained under-replication often precedes an OfflinePartitions page.

Diagnosis#

kubectl config use-context hetzner

# Strimzi CRs and broker pods
kubectl get kafka,kafkanodepool -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/cluster -o wide

# Any broker not Ready / restarting?
kubectl get pods -n safetywing-<env>-infra | grep -vE "Running|Completed"

# Which partitions are under-replicated / below min ISR
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
  bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
  bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-min-isr-partitions

# Broker logs: replica fetcher, ISR shrink/expand, disk
kubectl logs -n safetywing-<env>-infra <broker-pod> --tail=200 | grep -iE "ISR|replica|fetch"

Confirm scope in Prometheus (prom-ep.hetzner.safetywing.dev):

kafka_server_replicamanager_underreplicatedpartitions{namespace="safetywing-<env>-infra"}

Check per-topic ISR in Provectus kafka-ui (behind Google OIDC).

Mitigation#

  1. Find the broker(s) that left the ISR — almost always a pod that is down, restarting, or under resource pressure.
    kubectl describe pod -n safetywing-<env>-infra <broker-pod>
  2. If a broker is crash-looping or OOM, address the cause (memory limits, disk) and let it rejoin; Strimzi recreates deleted pods:
    kubectl delete pod -n safetywing-<env>-infra <broker-pod>
  3. If it is a slow/lagging follower (not down), check disk I/O and storage health (PVC / Rook-Ceph) — replication may simply be catching up after a restart.
    kubectl get pvc -n safetywing-<env>-infra
  4. Watch the ISR re-expand; underreplicatedpartitions should trend to 0 once followers catch up.
  5. If under-replication persists with all brokers Ready, inspect replica fetcher errors in logs and consider a partition reassignment to rebalance:
    kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
      bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --help
  6. Avoid lowering min.insync.replicas to silence the alert — that trades durability for green dashboards.

References#