KafkaOfflinePartitions • SafetyWing Runbooks

Meaning#

One or more partitions have no leader, so they cannot serve reads or writes. Any producer or consumer touching an offline partition is blocked, which usually means data loss risk and stalled traffic across affected topics.

Fires when: any broker reports a non-zero offline partition count for 5m. Severity page, tier component.

max(kafka_controller_kafkacontroller_offlinepartitionscount{namespace="safetywing-<env>-infra"}) > 0

Impact#

Produce and consume requests to offline partitions fail or hang.
Consumer groups stall on the affected partitions; lag grows.
Topics with offline partitions are effectively partially unavailable.
Often a symptom of multiple broker failures or unavailable replicas.

Diagnosis#

kubectl config use-context hetzner

# Strimzi CRs and broker pods
kubectl get kafka,kafkanodepool -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/cluster

# Which brokers are not Ready
kubectl get pods -n safetywing-<env>-infra -o wide | grep -v Running

# Broker / controller logs (look for leader election, ISR, disk errors)
kubectl logs -n safetywing-<env>-infra <broker-pod> --tail=200

# Cluster + topic state from inside a broker
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
  bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --under-min-isr-partitions
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
  bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --unavailable-partitions

Confirm scope in Prometheus (prom-ep.hetzner.safetywing.dev):

max(kafka_controller_kafkacontroller_offlinepartitionscount{namespace="safetywing-<env>-infra"})
kafka_server_replicamanager_underreplicatedpartitions{namespace="safetywing-<env>-infra"}

Cross-check leaders/ISR per topic in Provectus kafka-ui (behind Google OIDC).

Mitigation#

Identify down/unhealthy brokers from the pod listing. A partition goes offline when all of its replicas are unavailable.

Restore brokers first — restart crash-looping pods and check events for scheduling, PVC, or OOM issues:

kubectl describe pod -n safetywing-<env>-infra <broker-pod>
kubectl delete pod -n safetywing-<env>-infra <broker-pod>   # let Strimzi recreate

If a broker is stuck on storage, check the PVC / Rook-Ceph volume; a corrupt or full disk keeps the broker from rejoining the ISR.
```
kubectl get pvc -n safetywing-<env>-infra
```

Once enough replicas are back, leaders re-elect automatically. If not, trigger preferred-leader election:

kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
  bin/kafka-leader-election.sh --bootstrap-server localhost:9092 \
  --election-type PREFERRED --all-topic-partitions

If a partition’s only replicas are permanently lost, you must accept data loss via unclean leader election (last resort, topic-scoped) — confirm with the owning team before enabling.
After recovery, verify offlinepartitionscount returns to 0 and under-replicated partitions clear.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#