10.7.2 监控工具与可视化方案(JMX/Exporter/Prometheus)
监控工具与可视化方案(JMX/Exporter/Prometheus)#
Kafka 监控需覆盖 Broker/Controller/Producer/Consumer 与 JVM/OS。推荐以 JMX 为数据源,通过 Exporter 暴露为 Prometheus 指标,在 Grafana 做可视化与告警,实现端到端链路观测与容量趋势分析。
原理草图(数据流)
flowchart LR
A[Kafka Broker JVM] -->|JMX| B[JMX Exporter]
B -->|HTTP /metrics| C[Prometheus]
C --> D[Grafana]
C --> E[Alertmanager]
E --> F[通知/值班]
JMX 启用与验证(示例)
1) 在 Kafka 启动脚本或 systemd 中开启 JMX:
# /etc/systemd/system/kafka.service.d/jmx.conf
[Service]
Environment="JMX_PORT=9999"
Environment="KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false"
2) 重载并重启:
systemctl daemon-reload
systemctl restart kafka
3) 验证端口:
ss -lntp | grep 9999
# 预期:LISTEN 0 50 *:9999
JMX Exporter 安装与配置(Java Agent)
# 1) 安装
mkdir -p /opt/jmx_exporter
cd /opt/jmx_exporter
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
cat > /opt/jmx_exporter/kafka.yml <<'EOF'
startDelaySeconds: 0
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec)><>Count'
name: kafka_broker_topic_bytes_total
labels:
direction: "$1"
- pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
name: kafka_under_replicated_partitions
- pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(Produce|FetchConsumer)><>Count'
name: kafka_request_total
labels:
request: "$1"
- pattern: 'java.lang<type=Memory><HeapMemoryUsage>used'
name: jvm_heap_used_bytes
EOF
# 2) 配置 Kafka 启动参数(示例:/etc/kafka/kafka.env)
cat > /etc/kafka/kafka.env <<'EOF'
KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/jmx_exporter/kafka.yml"
EOF
# 3) 在 systemd 引用
cat > /etc/systemd/system/kafka.service.d/opts.conf <<'EOF'
[Service]
EnvironmentFile=/etc/kafka/kafka.env
EOF
systemctl daemon-reload
systemctl restart kafka
# 4) 验证 Exporter
curl -s http://localhost:7071/metrics | head
# 预期:输出 kafka_* 与 jvm_* 指标
命令说明:-javaagent=jar=port:config 表示以 Java Agent 方式启动 Exporter,并在指定端口提供 /metrics。
Prometheus 采集配置
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kafka-broker'
static_configs:
- targets: ['10.0.0.11:7071','10.0.0.12:7071','10.0.0.13:7071']
systemctl restart prometheus
Grafana 看板关键面板(示例指标)
- 吞吐:rate(kafka_broker_topic_bytes_total[5m])
- ISR:kafka_under_replicated_partitions
- 请求延迟(需导出 Summary/Histogram 指标)
- JVM 堆使用:jvm_heap_used_bytes / jvm_heap_max_bytes
告警示例(Prometheus Rule)
# /etc/prometheus/rules/kafka.yml
groups:
- name: kafka
rules:
- alert: KafkaUnderReplicatedPartitions
expr: kafka_under_replicated_partitions > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka ISR 异常"
description: "存在未充分复制分区,持续超过 1 分钟"
排错要点与命令(常见问题)
1) Exporter 无指标:
curl -s http://localhost:7071/metrics | wc -l
# 若为 0,检查 Kafka 是否携带 -javaagent
ps -ef | grep kafka | grep javaagent
2) JMX 端口不可达:
ss -lntp | grep 9999
# 若无监听,检查 systemd 环境变量是否生效
systemctl show kafka | grep JMX
3) 指标名不匹配:
# 打印 MBean 列表(需 jmxterm)
java -jar jmxterm-1.0.2-uber.jar -l localhost:9999 -n
# 交互后:beans; 列出 MBean 名,调整规则 pattern
练习
1) 将 UnderReplicatedPartitions 添加到 Grafana 面板并设置阈值告警。
2) 通过 kafka-topics.sh 创建一个高分区主题,观察吞吐指标变化。
3) 模拟网络抖动(如 tc qdisc),观察 RequestMetrics 延迟曲线。
4) 调整 scrape_interval 为 30s,对比 Prometheus 数据量变化。