Monitoring và performance tuning Kafka: các metrics quan trọng và cách tối ưu throughput/latency?

Metrics quan trọng cần monitor: UnderReplicatedPartitions (>0 là dấu hiệu vấn đề replication), OfflinePartitionsCount (cần alert ngay khi >0), BytesInPerSec/BytesOutPerSec (throughput), RequestHandlerAvgIdlePercent (<0.2 là broker overloaded), consumer lag (records-lag-max) để detect consumer chậm.

Tối ưu throughput producer: tăng batch.size (16KB→128KB), thêm linger.ms (0→20ms), bật compression.type=lz4 giảm network I/O.
Tối ưu throughput consumer: tăng fetch.min.bytes và fetch.max.wait.ms để fetch theo batch lớn, tăng max.poll.records.
Tối ưu broker: tăng số thread I/O (num.io.threads), dùng dedicated disk cho Kafka log (tránh share với OS), đặt log.dirs trên multiple disk để parallel I/O.
Dùng Kafka Exporter + Prometheus + Grafana cho observability stack.

Key metrics to monitor: UnderReplicatedPartitions (> 0 signals a replication issue), OfflinePartitionsCount (alert immediately if > 0), BytesInPerSec/BytesOutPerSec (throughput), RequestHandlerAvgIdlePercent (< 0.2 means the broker is overloaded), and consumer lag (records-lag-max) to detect slow consumers.

Producer throughput tuning: increase batch.size (16 KB → 128 KB), add linger.ms (0 → 20 ms), enable compression.type=lz4 to reduce network I/O.
Consumer throughput tuning: increase fetch.min.bytes and fetch.max.wait.ms to fetch in larger batches, increase max.poll.records.
Broker tuning: increase I/O threads (num.io.threads), use a dedicated disk for Kafka logs (avoid sharing with the OS), and configure log.dirs across multiple disks for parallel I/O.
Use Kafka Exporter + Prometheus + Grafana for the observability stack.

Xem toàn bộ Kafka cùng filter theo level & chủ đề con.

Mở danh sách Kafka

Tối ưu throughput producer: tăng batch.size (16KB→128KB), thêm linger.ms (0→20ms), bật compression.type=lz4 giảm network I/O.
Tối ưu throughput consumer: tăng fetch.min.bytes và fetch.max.wait.ms để fetch theo batch lớn, tăng max.poll.records.
Tối ưu broker: tăng số thread I/O (num.io.threads), dùng dedicated disk cho Kafka log (tránh share với OS), đặt log.dirs trên multiple disk để parallel I/O.
Dùng Kafka Exporter + Prometheus + Grafana cho observability stack.

Producer throughput tuning: increase batch.size (16 KB → 128 KB), add linger.ms (0 → 20 ms), enable compression.type=lz4 to reduce network I/O.
Consumer throughput tuning: increase fetch.min.bytes and fetch.max.wait.ms to fetch in larger batches, increase max.poll.records.
Broker tuning: increase I/O threads (num.io.threads), use a dedicated disk for Kafka logs (avoid sharing with the OS), and configure log.dirs across multiple disks for parallel I/O.
Use Kafka Exporter + Prometheus + Grafana for the observability stack.