SafetyWing Runbooks

Alert Catalog

Mon, 01 Jan 0001 00:00:00 +0000

Alert Catalog#

Every alert evaluated across SafetyWing clusters — 29 custom (component / environment / platform tiers, owned by us) and 133 stock (kube-prometheus-stack defaults). Custom alerts link to the runbook on this site; stock alerts link to the upstream prometheus-operator runbooks.

Generated from the live hetzner rule set + the infra-charts/cluster-monitors sources. Stock alerts are identical across clusters; custom alerts deploy per environment/cluster where the chart is enabled.

SafetyWing custom alerts#

Kafka (component tier)#

Alert	Severity	Runbook
KafkaOfflinePartitions	`page`	runbook
KafkaNoActiveController	`page`	runbook
KafkaUnderReplicatedPartitions	`ticket`	runbook
KafkaConsumerGroupLagHigh	`ticket`	runbook

Kafka Connect (component tier)#

Alert	Severity	Runbook
KafkaConnectFailedTasks	`page`	runbook
KafkaConnectWorkersDown	`page`	runbook
KafkaConnectNoConnectors	`ticket`	runbook

MySQL (component tier)#

Alert	Severity	Runbook
MysqlInstanceDown	`page`	runbook
MysqlConnectionsSaturated	`ticket`	runbook
MysqlReplicationLagHigh	`ticket`	runbook
MysqlDiskFillingUp	`ticket`	runbook

RabbitMQ (component tier)#

Alert	Severity	Runbook
RabbitmqNodeDown	`page`	runbook
RabbitmqMemoryAlarm	`page`	runbook
RabbitmqDiskAlarm	`page`	runbook
RabbitmqQueueBacklog	`ticket`	runbook
RabbitmqQueueNoConsumers	`ticket`	runbook

Ceph (platform tier)#

Alert	Severity	Runbook
CephHealthError	`page`	runbook
CephMonOutOfQuorum	`page`	runbook
CephHealthWarning	`ticket`	runbook
CephOSDDown	`ticket`	runbook
CephClusterNearFull	`ticket`	runbook

Elasticsearch (platform tier)#

Alert	Severity	Runbook
ElasticsearchClusterRed	`page`	runbook
ElasticsearchClusterYellow	`ticket`	runbook
ElasticsearchHeapHigh	`ticket`	runbook
ElasticsearchDiskWatermark	`ticket`	runbook

Node (platform tier)#

Alert	Severity	Runbook
NodeFilesystemAlmostFull	`ticket`	runbook

Traefik (platform tier)#

Alert	Severity	Runbook
TraefikDown	`page`	runbook
TraefikHigh5xxRate	`ticket`	runbook

Environment (environment tier)#

Alert	Severity	Runbook
EnvironmentHigh5xxRate	`ticket`	runbook

Stock alerts (kube-prometheus-stack)#

Shipped by the kube-prometheus-stack defaultRules. Documented upstream — links go there.

CephClusterNearFull

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Raw capacity usage of the Rook-Ceph cluster has crossed the configured threshold. As Ceph approaches full it first throttles, then refuses writes — so this alert is a capacity early-warning that needs action before it becomes an outage.

Fires when:

ceph_cluster_total_used_bytes / ceph_cluster_total_bytes > <ratio>

for: 15m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

When Ceph hits its nearfull/backfillfull/full ratios it degrades to HEALTH_WARN then HEALTH_ERR, and at the full ratio it blocks writes and can force volumes read-only. Because Ceph backs PVCs across all environments on hetzner, a full cluster is a multi-environment storage outage.

CephHealthError

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Rook-Ceph cluster is reporting HEALTH_ERR — Ceph has detected one or more error-level conditions and storage is at risk. This is the most severe Ceph health state.

Fires when:

ceph_health_status == 2

for: 5m, severity page, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Ceph backs the PVCs for workloads across all environments on the hetzner cluster. In HEALTH_ERR, IO may stall, volumes can be forced read-only, and writes can be blocked. Treat as an active or imminent storage outage affecting every environment that depends on Ceph-backed storage.

CephHealthWarning

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Rook-Ceph cluster has been in HEALTH_WARN for a sustained period. Ceph is functional but degraded — something needs attention before it escalates to HEALTH_ERR.

Fires when:

ceph_health_status == 1

for: 30m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Usually no immediate outage — IO continues. But HEALTH_WARN indicates reduced redundancy or headroom (degraded PGs, an OSD nearing full, a flapping mon, etc.) that affects storage backing PVCs across all environments. Left unaddressed it can progress to HEALTH_ERR and read-only/blocked writes.

CephMonOutOfQuorum

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

At least one Ceph monitor (mon) has dropped out of quorum. Mons maintain the cluster map and consensus; losing one reduces fault tolerance, and losing a majority halts the cluster.

Fires when:

count(ceph_mon_quorum_status == 0) > 0

for: 10m, severity page, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Mons are the control plane of Ceph. With one mon out of quorum the cluster still serves IO but has no redundancy margin; if quorum is lost entirely, all Ceph IO stops and PVCs across all environments on hetzner become unavailable.

CephOSDDown

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

One or more Ceph OSDs (object storage daemons — the per-disk processes that store data) are marked down. Each OSD maps to a physical disk on a hetzner node; a down OSD reduces redundancy and capacity.

Fires when:

count(ceph_osd_up == 0) > 0

for: 10m, severity ticket, tier platform. This is a cluster-wide platform alert and carries no environment label — only cluster.

Impact#

Ceph keeps serving IO from surviving replicas, so usually no outage. But redundancy is reduced and recovery/backfill load increases. Multiple OSDs down (or a full failure domain) can cause degraded/unavailable PGs and put PVCs across all environments at risk.

ElasticsearchClusterRed

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Elasticsearch cluster health is RED: at least one primary shard is unassigned, so part of the index data is unavailable and writes to affected indices fail. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_cluster_health_status{color="red"} == 1

for: 5m, severity page, tier platform.

Impact#

Search and indexing for the affected indices is down. This includes logging/observability data streams and application search indices (e.g. sw_user, sw_company, sw_company_member). Any service that reads or writes those indices will see errors or empty results until the primaries are reassigned.

ElasticsearchClusterYellow

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Elasticsearch cluster health is YELLOW: all primary shards are assigned, but one or more replica shards are unassigned. Data is fully available, but redundancy is reduced. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_cluster_health_status{color="yellow"} == 1

for: 30m, severity ticket, tier platform.

Impact#

No outage. Reads and writes continue to work. The risk is reduced fault tolerance: if a node holding a primary now fails, the cluster could go RED because there is no replica to promote. Performance for read-heavy indices may also drop while replicas are missing.

ElasticsearchDiskWatermark

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Free disk space on at least one Elasticsearch data node has dropped below 15%, approaching the flood-stage watermark (default 95% used). At flood stage Elasticsearch makes indices on the affected node read-only to protect the disk. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

min(elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes) < 0.15

for: 15m, severity ticket, tier platform.

Impact#

As watermarks are crossed, ES stops allocating new shards to the node (high watermark, can cause YELLOW), and at flood stage applies the index.blocks.read_only_allow_delete block — writes to affected indices fail while reads continue. Logging/observability ingestion and application indices stop accepting new data until disk is freed and the block is cleared.

ElasticsearchHeapHigh

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

JVM heap usage on an Elasticsearch node has been above 90% of max heap for a sustained period. The name label identifies the node. Persistent heap pressure causes frequent/long GC pauses, slow responses, and can destabilize or OOM the node. This is a cluster-wide platform alert and carries no environment label, only cluster.

Fires when:

elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.90

for: 15m, severity ticket, tier platform.

Impact#

Degraded performance across the cluster: increased query/index latency, GC stop-the-world pauses, and risk of the affected node dropping out (which can cascade to YELLOW/RED). No immediate data loss while the alert is just heap pressure.

EnvironmentHigh5xxRate

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The 5xx error ratio for a single environment exceeded its configured threshold. This is an environment-tier SLO alert emitted per environment by the safetywing-environment chart, scoped to that environment’s application services.

Fires when: the environment’s 5xx ratio crosses the threshold while traffic is above a minimum RPS floor (the minRps guard avoids alerting on noise at low traffic).

# hetzner / Traefik form
sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*",code=~"5.."}[5m]))
 / sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*"}[5m])) > <ratio>
and
sum(rate(traefik_service_requests_total{service=~"safetywing-<env>-applications-.*"}[5m])) > <minRps>

On GKE environments the ingress is nginx, so the source metric differs:

KafkaConnectFailedTasks

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

One or more Debezium CDC connector tasks have entered the FAILED state, so change capture for the affected connector is degraded or fully stopped. Fires when:

max(kafka_connect_worker_metrics_connector_failed_task_count{namespace="safetywing-<env>-infra"}) > 0

for: 10m, severity page, tier component.

Impact#

CDC from MOCO MySQL into Kafka is stalled for the failed connector. Downstream consumers stop receiving database changes: search indices fall behind, derived/mirror tables go stale, and any event-driven flow fed by these topics no longer reflects new writes. Lag grows until the task is recovered.

KafkaConnectNoConnectors

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The Kafka Connect cluster is running but reports zero connectors, meaning no Debezium CDC source connector is deployed or running — CDC for the environment may be unconfigured or all connectors were removed. Fires when:

max(kafka_connect_worker_metrics_connector_count{namespace="safetywing-<env>-infra"}) == 0

for: 15m, severity ticket, tier component.

Impact#

No change capture is happening at all in this environment: no MySQL changes flow from MOCO MySQL into Kafka. Downstream consumers (search indices, mirror/derived tables, event-driven flows) receive nothing new. For a freshly provisioned env this may be expected during bring-up; for an established env it means CDC is silently broken.

KafkaConnectWorkersDown

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Fewer Kafka Connect workers are reporting metrics than the number of replicas the kafka-cdc chart expects, indicating one or more Connect pods are down, crash-looping, or not scraping. Fires when:

count(kafka_connect_worker_metrics_connector_count{namespace="safetywing-<env>-infra"}) < <connect.replicas>

for: 5m, severity page, tier component.

Impact#

Reduced Connect capacity and resilience for the CDC pipeline. Tasks owned by the missing worker are rebalanced onto survivors (added load, possible throughput drop and lag); if the cluster is at one replica, CDC from MOCO MySQL into Kafka is fully stopped and downstream consumers (search indices, mirror tables, event flows) stop receiving DB changes.

KafkaConsumerGroupLagHigh

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A consumer group is falling behind the producers on a topic — the gap between the latest offset and the group’s committed offset (lag) has exceeded the threshold. Messages are being produced faster than they are consumed, so processing is delayed. The consumergroup and topic labels identify exactly which consumer and topic are affected.

Fires when: per-(consumergroup, topic) lag exceeds <threshold> for 15m. Severity ticket, tier component.

max by (consumergroup, topic) (kafka_consumergroup_lag{namespace="safetywing-<env>-infra"}) > <threshold>

Impact#

Delayed processing for the affected consumer group → stale downstream data, late side effects, growing end-to-end latency.
If lag keeps climbing, retention may expire un-consumed messages, causing permanent message loss.
Brokers retain more unconsumed data, increasing disk usage.

Diagnosis#

kubectl config use-context hetzner
kubectl get kafka,kafkanodepool -n safetywing-<env>-infra

# Describe the lagging group: per-partition LAG, CURRENT-OFFSET, LOG-END-OFFSET, CONSUMER-ID
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
 --describe --group <consumergroup>

# Are there active members, or is the group empty / rebalancing?
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
 bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
 --describe --group <consumergroup> --members --verbose

# Inspect the consuming application workload
kubectl get pods -A | grep <consumer-app>
kubectl logs -n <consumer-ns> <consumer-pod> --tail=200

Confirm trend and scope in Prometheus (prom-ep.hetzner.safetywing.dev):

KafkaNoActiveController

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

In KRaft mode exactly one controller node should be active (the metadata quorum leader). This alert means the cluster sees zero active controllers (no quorum leader) or more than one (split brain). Either state puts cluster metadata — topic, partition, ISR, and config state — at risk and blocks administrative operations.

Fires when: the summed active controller count across the namespace is not exactly 1 for 5m. Severity page, tier component.

KafkaOfflinePartitions

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

One or more partitions have no leader, so they cannot serve reads or writes. Any producer or consumer touching an offline partition is blocked, which usually means data loss risk and stalled traffic across affected topics.

Fires when: any broker reports a non-zero offline partition count for 5m. Severity page, tier component.

max(kafka_controller_kafkacontroller_offlinepartitionscount{namespace="safetywing-<env>-infra"}) > 0

Impact#

Produce and consume requests to offline partitions fail or hang.
Consumer groups stall on the affected partitions; lag grows.
Topics with offline partitions are effectively partially unavailable.
Often a symptom of multiple broker failures or unavailable replicas.

Diagnosis#

kubectl config use-context hetzner

# Strimzi CRs and broker pods
kubectl get kafka,kafkanodepool -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/cluster

# Which brokers are not Ready
kubectl get pods -n safetywing-<env>-infra -o wide | grep -v Running

# Broker / controller logs (look for leader election, ISR, disk errors)
kubectl logs -n safetywing-<env>-infra <broker-pod> --tail=200

# Cluster + topic state from inside a broker
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
 bin/kafka-topics.sh --bootstrap-server localhost:9092 \
 --describe --under-min-isr-partitions
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
 bin/kafka-topics.sh --bootstrap-server localhost:9092 \
 --describe --unavailable-partitions

Confirm scope in Prometheus (prom-ep.hetzner.safetywing.dev):

KafkaUnderReplicatedPartitions

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Partitions have fewer in-sync replicas than their configured replication factor. The cluster is still serving traffic, but durability is reduced — losing one more broker could take partitions offline or lose data. Usually a broker is down, restarting, or lagging behind on replication.

Fires when: any broker reports a non-zero under-replicated partition count for 10m. Severity ticket, tier component.

max(kafka_server_replicamanager_underreplicatedpartitions{namespace="safetywing-<env>-infra"}) > 0

Impact#

Reduced fault tolerance: a single additional broker failure may cause offline partitions or data loss.
Producers using acks=all may slow down or block if the ISR drops below min.insync.replicas.
Sustained under-replication often precedes an OfflinePartitions page.

Diagnosis#

kubectl config use-context hetzner

# Strimzi CRs and broker pods
kubectl get kafka,kafkanodepool -n safetywing-<env>-infra
kubectl get pods -n safetywing-<env>-infra -l strimzi.io/cluster -o wide

# Any broker not Ready / restarting?
kubectl get pods -n safetywing-<env>-infra | grep -vE "Running|Completed"

# Which partitions are under-replicated / below min ISR
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
 bin/kafka-topics.sh --bootstrap-server localhost:9092 \
 --describe --under-replicated-partitions
kubectl exec -n safetywing-<env>-infra <broker-pod> -- \
 bin/kafka-topics.sh --bootstrap-server localhost:9092 \
 --describe --under-min-isr-partitions

# Broker logs: replica fetcher, ISR shrink/expand, disk
kubectl logs -n safetywing-<env>-infra <broker-pod> --tail=200 | grep -iE "ISR|replica|fetch"

Confirm scope in Prometheus (prom-ep.hetzner.safetywing.dev):

MysqlConnectionsSaturated

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The number of connected threads on a MySQL instance is approaching max_connections. New connections risk being refused with ER_CON_COUNT_ERROR.

Fires when:

max by (pod) (
 mysql_global_status_threads_connected{namespace="safetywing-<env>-infra"}
 / mysql_global_variables_max_connections{namespace="safetywing-<env>-infra"}
) > <ratio>

for: 10m, severity ticket, tier component.

Impact#

Once max_connections is hit, new clients get “Too many connections” and application requests fail.
Often a symptom of leaked/unclosed connections, an oversized client pool, or slow queries holding connections open.

Diagnosis#

kubectl config use-context hetzner
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl moco status -n safetywing-<env>-infra <cluster>

# Open a mysql shell to the primary and inspect live connections
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
 -e "SHOW PROCESSLIST;"
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
 -e "SHOW STATUS LIKE 'Threads_connected'; SHOW VARIABLES LIKE 'max_connections';"

# Group connections by host/user to find the offender
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
 -e "SELECT user, host, count(*) FROM information_schema.processlist GROUP BY user, host ORDER BY 3 DESC;"

# Current usage ratio per pod
max by (pod) (
 mysql_global_status_threads_connected{namespace="safetywing-<env>-infra"}
 / mysql_global_variables_max_connections{namespace="safetywing-<env>-infra"}
)

Mitigation#

Identify the offending client(s) from SHOW PROCESSLIST / the grouped query above — usually one service with a misconfigured pool or leaked connections.
Fix at the source: scale down the offending workload, tune its connection pool max size, or restart it to drop leaked connections.
Kill stuck/sleeping connections if needed:
```
KILL <process_id>;
```
If demand is legitimate, raise max_connections in the MySQLCluster spec (MOCO reconciles it into the instances’ my.cnf):
```
spec:
 mysqlConfigMapName: <name>  # or set under spec.podTemplate config
# in the referenced ConfigMap:
# max_connections = <n>
```
Ensure the instance has memory headroom — each connection consumes per-thread buffers.

References#

MysqlDiskFillingUp

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A mysql-data-* PersistentVolumeClaim is running low on free space. If it fills completely, mysqld will fail writes and may refuse to start.

Fires when:

min by (persistentvolumeclaim) (
 kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"mysql-data-.*"}
 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"mysql-data-.*"}
) < (1 - <ratio>)

for: 15m, severity ticket, tier component.

Impact#

A full data volume causes write errors and can crash the instance (MysqlInstanceDown).
Common culprits: accumulated binary logs, an oversized dataset, or relay logs/temp files on a lagging replica.

Diagnosis#

kubectl config use-context hetzner
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl get pvc -n safetywing-<env>-infra

# Inspect on-disk usage from inside the mysqld container
kubectl exec -n safetywing-<env>-infra <pod> -c mysqld -- df -h /var/lib/mysql
kubectl exec -n safetywing-<env>-infra <pod> -c mysqld -- \
 sh -c 'du -sh /var/lib/mysql/* | sort -h | tail -20'

# Binary log inventory
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
 -e "SHOW BINARY LOGS;"
# Largest tables
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
 -e "SELECT table_schema, table_name, ROUND((data_length+index_length)/1024/1024) AS mb FROM information_schema.tables ORDER BY mb DESC LIMIT 20;"

# Fraction free per PVC
min by (persistentvolumeclaim) (
 kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"mysql-data-.*"}
 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"mysql-data-.*"}
)

Mitigation#

Expand the PVC (preferred — the StorageClass supports volume expansion). Increase the volume request in the MySQLCluster volumeClaimTemplates; MOCO/Kubernetes resizes the PVC online:
```
spec:
 volumeClaimTemplates:
 - metadata:
 name: mysql-data
 spec:
 resources:
 requests:
 storage: <larger-size>
```
Apply via GitOps, then confirm with kubectl get pvc -n safetywing-<env>-infra.
Prune binary logs if they dominate usage (only purge logs already applied by all replicas):
```
PURGE BINARY LOGS BEFORE NOW() - INTERVAL 1 DAY;
```
For a durable fix, tune binlog_expire_logs_seconds in the cluster MySQL config.
Replica behind: a lagging replica accumulates relay logs — clearing the lag (MysqlReplicationLagHigh) lets them be purged.
Reclaim table space: drop unused data or run OPTIMIZE TABLE on bloated tables (note: requires temporary extra space, so resize first if very full).

References#

MysqlInstanceDown

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A MySQL instance’s mysqld_exporter sidecar reports the server unreachable, so the mysqld process is down or not accepting connections.

Fires when:

max by (pod) (mysql_up{namespace="safetywing-<env>-infra"}) == 0

for: 5m, severity page, tier component.

Impact#

The affected instance serves no queries.
If the primary is down, MOCO must fail over before writes can resume; expect a short write outage.
If a replica is down, read capacity and replication redundancy are reduced.

Diagnosis#

kubectl config use-context hetzner

# Cluster + member roles (which pod is primary vs replica)
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl moco status -n safetywing-<env>-infra <cluster>

# Pod state and recent events (OOMKill, evictions, probe failures)
kubectl get pods -n safetywing-<env>-infra -l app.kubernetes.io/name=mysql -o wide
kubectl describe pod -n safetywing-<env>-infra <pod>

# Container logs — mysqld, the MOCO agent, and the exporter
kubectl logs -n safetywing-<env>-infra <pod> -c mysqld --tail=200
kubectl logs -n safetywing-<env>-infra <pod> -c agent --tail=200
kubectl logs -n safetywing-<env>-infra <pod> -c mysqld-exporter --tail=100

# Confirm which pod(s) are down
max by (pod) (mysql_up{namespace="safetywing-<env>-infra"})

Mitigation#

Check pod events for the root cause: OOMKilled (raise memory limits in the MySQLCluster .spec.podTemplate), node pressure/eviction, or failed PVC mount.
If disk is full, mysqld will refuse to start — see MysqlDiskFillingUp and expand the PVC first.
If the process crashed but the pod is healthy, restart it:
```
kubectl delete pod -n safetywing-<env>-infra <pod>
```
MOCO recreates the pod; verify it rejoins via kubectl moco status.
If the primary is down and not recovering, let MOCO fail over to a healthy replica; confirm a new primary was elected in kubectl moco status. Investigate the old primary before reintroducing it.

If MOCO cannot reconcile, inspect the operator:

kubectl logs -n moco-system deploy/moco-controller-manager --tail=200

References#

MysqlReplicationLagHigh

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A MySQL replica is applying the primary’s binlog stream slower than it is produced, so its Seconds_Behind_Master has exceeded the threshold.

Fires when:

max by (pod) (
 mysql_slave_status_seconds_behind_master{namespace="safetywing-<env>-infra"}
) > <seconds>

for: 10m, severity ticket, tier component.

Impact#

Reads served by the lagging replica return stale data.
A failover to a lagging replica could lose recent writes (MOCO uses semi-sync, which bounds but does not eliminate this risk under degraded conditions).

Diagnosis#

kubectl config use-context hetzner
kubectl get mysqlcluster -n safetywing-<env>-infra
kubectl moco status -n safetywing-<env>-infra <cluster>

# Replication status on the lagging replica
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin --index <n> <cluster> -- \
 -e "SHOW REPLICA STATUS\G"

# Container logs of the replica (mysqld + agent)
kubectl logs -n safetywing-<env>-infra <replica-pod> -c mysqld --tail=200
kubectl logs -n safetywing-<env>-infra <replica-pod> -c agent --tail=200

# Long-running transactions on the primary that bloat the binlog
kubectl moco mysql -n safetywing-<env>-infra -u moco-admin <cluster> -- \
 -e "SELECT * FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 10;"

# Lag per pod
max by (pod) (mysql_slave_status_seconds_behind_master{namespace="safetywing-<env>-infra"})

In SHOW REPLICA STATUS, check Replica_IO_Running / Replica_SQL_Running (both should be Yes), Last_Error, and Seconds_Behind_Master.

NodeFilesystemAlmostFull

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A node filesystem is running low on free space. This is a supplemental platform-tier rule on top of the kube-prometheus-stack node-exporter mixin.

Fires when: available space on a non-ephemeral filesystem drops below the configured ratio.

(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs"}
 / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs"}) < <ratio>

for: 15m, severity ticket, tier platform (cluster-wide, no environment label). The offending filesystem is identified by the instance and mountpoint labels.

Impact#

A full filesystem on a node can wedge the kubelet, fail image pulls, block container log writes, evict pods (ephemeral-storage pressure), and on Talos can disrupt the system partition. If the node hosts stateful workloads (Ceph OSDs, MySQL/MOCO, RabbitMQ), data writes can stall or fail.

RabbitmqDeadLetterMessages

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A dead-letter queue (DLQ) holds one or more ready messages. DLQs are the topology chart’s {namespace}.deadletter queues — under normal operation they are empty.

Fires when:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra", queue=~".+[.]deadletter"}) > 0

for: 15m, severity ticket, tier component. The queue label identifies the affected DLQ (always ends in .deadletter).

Each rabbitmq-topology namespace wires a three-queue retry flow: the main queue {ns} dead-letters failed messages to {ns}.retry (which holds them for retryDelay, default 10 min, then re-publishes to the main queue). A message only lands in the dead-letter queue {ns}.deadletter when it is dead-lettered with the deadletter routing key — i.e. it has exhausted its retries or was explicitly rejected as terminally unprocessable. So a non-empty DLQ means “messages a consumer gave up on”, not transient backpressure.

RabbitmqDiskAlarm

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A RabbitMQ node has dropped below its free-disk-space watermark and raised a disk alarm. RabbitMQ blocks publishers to avoid filling the disk and corrupting the message store.

Fires when:

max(rabbitmq_alarms_free_disk_space_watermark{namespace="safetywing-<env>-infra"}) == 1

for: 5m, severity page, tier component.

Impact#

Publishing is blocked cluster-wide. As with the memory alarm, once any node trips the free-disk watermark RabbitMQ blocks all publishing connections until free space recovers. Consumers continue, but producers hang and dependent backend services back up. If the disk fills completely the node can crash and lose durability guarantees, so this must be cleared promptly.

RabbitmqMemoryAlarm

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A RabbitMQ node has crossed its memory high-watermark and raised a memory alarm. RabbitMQ responds by blocking all publishers across the cluster to protect the broker from running out of memory.

Fires when:

max(rabbitmq_alarms_memory_used_watermark{namespace="safetywing-<env>-infra"}) == 1

for: 5m, severity page, tier component.

Impact#

Publishing is blocked cluster-wide. Once any node hits the memory watermark, RabbitMQ throttles/blocks all connections that are publishing, so producers across every queue hang. Consumers keep draining, but new messages cannot be accepted. This typically surfaces as backend services timing out on publish and growing request latency until memory is reclaimed.

RabbitmqNodeDown

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Fewer RabbitMQ nodes are reporting metrics than the configured number of replicas, meaning one or more cluster members are down or unreachable.

Fires when:

count(rabbitmq_build_info{namespace="safetywing-<env>-infra"}) < <replicas>

for: 5m, severity page, tier component.

Impact#

A missing node reduces capacity and redundancy. With quorum queues, losing a node erodes the quorum margin; losing a majority makes those queues unavailable for reads and writes. Classic mirrored/single-node queues hosted on the down node become unavailable until it returns. Sustained node loss risks a full cluster outage.

RabbitmqQueueBacklog

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A queue has accumulated a large number of ready (undelivered) messages, meaning consumers are not keeping up with producers.

Fires when:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"}) > <threshold>

for: 15m, severity ticket, tier component. The queue label identifies the affected queue.

Impact#

Messages are being produced faster than they are consumed. Downstream processing is delayed, so whatever the queue feeds (notifications, payments, sync jobs, etc.) lags behind. A persistently growing backlog also consumes memory and disk and can eventually trip the memory or disk alarms and block publishing cluster-wide.

RabbitmqQueueNoConsumers

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

A queue has ready messages but zero consumers attached, so nothing is draining it. Messages will sit indefinitely until a consumer connects.

Fires when:

max by (queue) (rabbitmq_queue_messages_ready{namespace="safetywing-<env>-infra"}) > 0
and
max by (queue) (rabbitmq_queue_consumers{namespace="safetywing-<env>-infra"}) == 0

for: 15m, severity ticket, tier component. The queue label identifies the affected queue.

Impact#

Work enqueued on this queue is not being processed at all. Unlike a slow backlog, there is no progress whatsoever, so the dependent feature is effectively down. The backlog will keep growing and can eventually trip the memory or disk alarms and block publishing cluster-wide.

TraefikDown

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

Prometheus has no healthy Traefik scrape target. Either the Traefik edge ingress is down, or just its metrics endpoint/scraping is broken.

Fires when: there is no up == 1 series for any Traefik scrape job.

absent(up{job=~".*traefik.*"} == 1)

for: 10m, severity page, tier platform (cluster-wide, no environment label).

Impact#

Traefik is the edge ingress (a hostNetwork DaemonSet on the Hetzner cluster). If Traefik itself is down, all external HTTP(S) traffic into the cluster fails — every public domain behind it. If only metrics are broken, traffic may still flow but we are blind to edge health and 5xx alerting is degraded.

TraefikHigh5xxRate

Mon, 01 Jan 0001 00:00:00 +0000

Meaning#

The cluster-wide ratio of 5xx responses served by Traefik exceeds 5%. This is an edge-level signal aggregating all services behind Traefik.

Fires when: 5xx requests are more than 5% of all Traefik service requests.

sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
 / sum(rate(traefik_service_requests_total[5m])) > 0.05

for: 10m, severity ticket, tier platform (cluster-wide, no environment label).

Impact#

A meaningful fraction of requests entering through the edge are failing with server errors. Because this is cluster-wide, it usually points at either a broadly-impacting backend (a shared dependency) or one high-traffic service skewing the aggregate. Users across one or more environments see errors.