Metrics quan trọng cần monitor: UnderReplicatedPartitions (>0 là dấu hiệu vấn đề replication), OfflinePartitionsCount (cần alert ngay khi >0), BytesInPerSec/BytesOutPerSec (throughput), RequestHandlerAvgIdlePercent (<0.2 là broker overloaded), consumer lag (records-lag-max) để detect consumer chậm.
- Tối ưu throughput producer: tăng
batch.size(16KB→128KB), thêmlinger.ms(0→20ms), bậtcompression.type=lz4giảm network I/O. - Tối ưu throughput consumer: tăng
fetch.min.bytesvàfetch.max.wait.msđể fetch theo batch lớn, tăngmax.poll.records. - Tối ưu broker: tăng số thread I/O (
num.io.threads), dùng dedicated disk cho Kafka log (tránh share với OS), đặtlog.dirstrên multiple disk để parallel I/O. - Dùng Kafka Exporter + Prometheus + Grafana cho observability stack.
Key metrics to monitor: UnderReplicatedPartitions (> 0 signals a replication issue), OfflinePartitionsCount (alert immediately if > 0), BytesInPerSec/BytesOutPerSec (throughput), RequestHandlerAvgIdlePercent (< 0.2 means the broker is overloaded), and consumer lag (records-lag-max) to detect slow consumers.
- Producer throughput tuning: increase
batch.size(16 KB → 128 KB), addlinger.ms(0 → 20 ms), enablecompression.type=lz4to reduce network I/O. - Consumer throughput tuning: increase
fetch.min.bytesandfetch.max.wait.msto fetch in larger batches, increasemax.poll.records. - Broker tuning: increase I/O threads (
num.io.threads), use a dedicated disk for Kafka logs (avoid sharing with the OS), and configurelog.dirsacross multiple disks for parallel I/O. - Use Kafka Exporter + Prometheus + Grafana for the observability stack.