KafkaConsumerGroupLagHigh • SafetyWing Runbooks

Meaning#

A consumer group is falling behind the producers on a topic — the gap between the latest offset and the group’s committed offset (lag) has exceeded the threshold. Messages are being produced faster than they are consumed, so processing is delayed. The consumergroup and topic labels identify exactly which consumer and topic are affected.

Fires when: per-(consumergroup, topic) lag exceeds <threshold> for 15m. Severity ticket, tier component.

max by (consumergroup, topic) (kafka_consumergroup_lag{namespace="safetywing-<env>-infra"}) > <threshold>

Impact#

Delayed processing for the affected consumer group → stale downstream data, late side effects, growing end-to-end latency.
If lag keeps climbing, retention may expire un-consumed messages, causing permanent message loss.
Brokers retain more unconsumed data, increasing disk usage.

Diagnosis#

kubectl config use-context hetzner
kubectl get kafka,kafkanodepool -n safetywing-<env>-infra

# Describe the lagging group: per-partition LAG, CURRENT-OFFSET, LOG-END-OFFSET, CONSUMER-ID
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
  bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group <consumergroup>

# Are there active members, or is the group empty / rebalancing?
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
  bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group <consumergroup> --members --verbose

# Inspect the consuming application workload
kubectl get pods -A | grep <consumer-app>
kubectl logs -n <consumer-ns> <consumer-pod> --tail=200

Confirm trend and scope in Prometheus (prom-ep.hetzner.safetywing.dev):

max by (consumergroup, topic) (kafka_consumergroup_lag{namespace="safetywing-<env>-infra"})

# Is it growing or draining?
deriv(max by (consumergroup, topic) (kafka_consumergroup_lag{namespace="safetywing-<env>-infra"})[15m:])

The Grafana Kafka dashboards (grafana.safetywing.dev) and Provectus kafka-ui (behind Google OIDC) both show per-group lag.

Mitigation#

Check whether the consumer is healthy first — most lag is a stuck/slow consumer, not a Kafka problem:
- Pods crash-looping, OOM, or stuck rebalancing?
- --describe --members showing no active members → consumer is down.
```
kubectl describe pod -n <consumer-ns> <consumer-pod>
```
Restart or fix the consumer application if it is wedged on a poison message, deadlock, or downstream dependency.
If the consumer is healthy but simply outpaced, scale it out — add replicas/instances up to the partition count of the topic (parallelism is bounded by partitions).
If a single partition is hot, check the producer’s partitioning key for skew.
If lag is from a transient producer spike, confirm lag is draining (negative deriv) and let it catch up — no action beyond monitoring.

Verify the topic’s retention is long enough that backlog is not expiring before consumption:

kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
  bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --describe --topic <topic> --all | grep retention

After recovery, confirm lag falls back below <threshold>.

Meaning#

Impact#

Diagnosis#

Mitigation#

References#