10.7.2 监控工具与可视化方案(JMX/Exporter/Prometheus)

监控工具与可视化方案(JMX/Exporter/Prometheus)#

Kafka 监控需覆盖 Broker/Controller/Producer/Consumer 与 JVM/OS。推荐以 JMX 为数据源,通过 Exporter 暴露为 Prometheus 指标,在 Grafana 做可视化与告警,实现端到端链路观测与容量趋势分析。

原理草图(数据流)

flowchart LR
  A[Kafka Broker JVM] -->|JMX| B[JMX Exporter]
  B -->|HTTP /metrics| C[Prometheus]
  C --> D[Grafana]
  C --> E[Alertmanager]
  E --> F[通知/值班]

JMX 启用与验证(示例)
1) 在 Kafka 启动脚本或 systemd 中开启 JMX:

# /etc/systemd/system/kafka.service.d/jmx.conf
[Service]
Environment="JMX_PORT=9999"
Environment="KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote \
 -Dcom.sun.management.jmxremote.authenticate=false \
 -Dcom.sun.management.jmxremote.ssl=false"

2) 重载并重启:

systemctl daemon-reload
systemctl restart kafka

3) 验证端口:

ss -lntp | grep 9999
# 预期:LISTEN 0 50 *:9999

JMX Exporter 安装与配置(Java Agent)

# 1) 安装
mkdir -p /opt/jmx_exporter
cd /opt/jmx_exporter
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
cat > /opt/jmx_exporter/kafka.yml <<'EOF'
startDelaySeconds: 0
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec)><>Count'
    name: kafka_broker_topic_bytes_total
    labels:
      direction: "$1"
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
    name: kafka_under_replicated_partitions
  - pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(Produce|FetchConsumer)><>Count'
    name: kafka_request_total
    labels:
      request: "$1"
  - pattern: 'java.lang<type=Memory><HeapMemoryUsage>used'
    name: jvm_heap_used_bytes
EOF

# 2) 配置 Kafka 启动参数(示例:/etc/kafka/kafka.env)
cat > /etc/kafka/kafka.env <<'EOF'
KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/jmx_exporter/kafka.yml"
EOF

# 3) 在 systemd 引用
cat > /etc/systemd/system/kafka.service.d/opts.conf <<'EOF'
[Service]
EnvironmentFile=/etc/kafka/kafka.env
EOF

systemctl daemon-reload
systemctl restart kafka

# 4) 验证 Exporter
curl -s http://localhost:7071/metrics | head
# 预期:输出 kafka_* 与 jvm_* 指标

命令说明:-javaagent=jar=port:config 表示以 Java Agent 方式启动 Exporter,并在指定端口提供 /metrics

Prometheus 采集配置

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'kafka-broker'
    static_configs:
      - targets: ['10.0.0.11:7071','10.0.0.12:7071','10.0.0.13:7071']
systemctl restart prometheus

Grafana 看板关键面板(示例指标)
- 吞吐:rate(kafka_broker_topic_bytes_total[5m])
- ISR:kafka_under_replicated_partitions
- 请求延迟(需导出 Summary/Histogram 指标)
- JVM 堆使用:jvm_heap_used_bytes / jvm_heap_max_bytes

告警示例(Prometheus Rule)

# /etc/prometheus/rules/kafka.yml
groups:
- name: kafka
  rules:
  - alert: KafkaUnderReplicatedPartitions
    expr: kafka_under_replicated_partitions > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Kafka ISR 异常"
      description: "存在未充分复制分区,持续超过 1 分钟"

排错要点与命令(常见问题)
1) Exporter 无指标:

curl -s http://localhost:7071/metrics | wc -l
# 若为 0,检查 Kafka 是否携带 -javaagent
ps -ef | grep kafka | grep javaagent

2) JMX 端口不可达:

ss -lntp | grep 9999
# 若无监听,检查 systemd 环境变量是否生效
systemctl show kafka | grep JMX

3) 指标名不匹配:

# 打印 MBean 列表(需 jmxterm)
java -jar jmxterm-1.0.2-uber.jar -l localhost:9999 -n
# 交互后:beans; 列出 MBean 名,调整规则 pattern

练习
1) 将 UnderReplicatedPartitions 添加到 Grafana 面板并设置阈值告警。
2) 通过 kafka-topics.sh 创建一个高分区主题,观察吞吐指标变化。
3) 模拟网络抖动(如 tc qdisc),观察 RequestMetrics 延迟曲线。
4) 调整 scrape_interval 为 30s,对比 Prometheus 数据量变化。