9.8.1 监控指标体系与采集方式

本节围绕 Nacos 运维监控指标体系与采集方式展开，目标是建立可观测、可告警、可追溯的监控闭环，确保注册中心稳定与可用。

一、监控指标体系设计#

可用性指标：服务注册/发现成功率、心跳续约成功率、API 可用率、请求错误率（4xx/5xx）。
性能指标：注册/发现接口延迟、配置发布延迟、推送延迟、长轮询耗时、连接建立耗时。
容量指标：服务数、实例数、配置条目数、订阅者数量、客户端连接数、集群节点数。
资源指标：CPU、内存、磁盘、IO、网络流量、线程数、GC 次数与停顿时间。
稳定性指标：选主状态变更次数、健康检查失败次数、节点心跳丢失次数、重试次数。

原理草图（监控链路）：

二、关键监控指标清单#

注册与发现：nacos_naming_register_count、nacos_naming_subscribe_count、nacos_naming_push_latency
配置中心：nacos_config_publish_count、nacos_config_listener_count、nacos_config_notify_latency
系统资源：jvm_memory_used、jvm_gc_pause_seconds、process_cpu_usage、process_open_fds
服务治理：nacos_cluster_leader、nacos_cluster_health、nacos_raft_term（集群模式）

示例：快速查看指标是否暴露

# 预期：返回 Prometheus 文本格式指标
curl -s http://127.0.0.1:8848/nacos/actuator/prometheus | head -n 20

三、采集方式与实现路径#

1. Prometheus 指标采集#

安装示例（二进制版）：

# 下载并解压
cd /opt
wget https://github.com/prometheus/prometheus/releases/download/v2.49.1/prometheus-2.49.1.linux-amd64.tar.gz
tar -xf prometheus-2.49.1.linux-amd64.tar.gz
ln -s prometheus-2.49.1.linux-amd64 prometheus

配置示例 /opt/prometheus/prometheus.yml：

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'nacos'
    metrics_path: '/nacos/actuator/prometheus'
    static_configs:
      - targets:
          - '10.0.0.11:8848'
          - '10.0.0.12:8848'
        labels:
          cluster: 'nacos-prod'
          env: 'prod'

启动与验证：

/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml \
  --web.listen-address="0.0.0.0:9090" &
# 预期：Web 能访问 http://ip:9090/targets 看到 nacos UP

2. 日志采集与分析#

示例：Filebeat 采集 Nacos 日志到 Elasticsearch
/etc/filebeat/filebeat.yml：

filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /home/nacos/logs/nacos.log
      - /home/nacos/logs/config.log
      - /home/nacos/logs/naming.log
      - /home/nacos/logs/raft.log
    fields:
      app: nacos
      env: prod

output.elasticsearch:
  hosts: ["http://10.0.0.20:9200"]
  index: "nacos-logs-%{+yyyy.MM.dd}"

启动：

systemctl enable filebeat
systemctl start filebeat
# 预期：ES 中出现 nacos-logs-* 索引

3. 系统与主机层监控（Node Exporter）#

安装示例：

cd /opt
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xf node_exporter-1.7.0.linux-amd64.tar.gz
/opt/node_exporter-1.7.0.linux-amd64/node_exporter &
# 预期：9100 端口可访问
curl -s http://127.0.0.1:9100/metrics | head -n 5

Prometheus 追加采集：

  - job_name: 'node'
    static_configs:
      - targets: ['10.0.0.11:9100','10.0.0.12:9100']
        labels:
          cluster: 'nacos-prod'

4. 接口可用性探测#

示例：健康探测脚本（适合结合 cron 或探针系统）：

#!/usr/bin/env bash
URL="http://127.0.0.1:8848/nacos/v1/ns/instance/list?serviceName=demo"
HTTP_CODE=$(curl -s -o /tmp/nacos_check.json -w "%{http_code}" "$URL")

if [ "$HTTP_CODE" -ne 200 ]; then
  echo "CRITICAL: nacos api failed http_code=$HTTP_CODE"
  exit 2
fi

# 预期：返回 JSON 列表，http_code=200
echo "OK: nacos api ok"

四、告警策略与阈值建议#

高优先级：节点不可用、注册/发现失败率突增、选主失败、CPU>85%、内存>90%。
中优先级：注册/发现延迟高、配置发布失败率升高、长轮询超时比例提升。
低优先级：连接数持续增长、线程数异常、GC 频繁但未影响响应。

Prometheus 告警规则示例：

groups:
  - name: nacos-alerts
    rules:
      - alert: NacosNodeDown
        expr: up{job="nacos"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Nacos 节点不可用"
          description: "节点 {{ $labels.instance }} 超过 1 分钟不可用"

      - alert: NacosHighCPU
        expr: process_cpu_usage{job="nacos"} > 0.85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Nacos CPU 过高"

五、监控落地建议#

统一标签标准：如 cluster、node、env，便于汇总分析。
分层可视化：集群概览→节点详情→接口级指标→日志追踪。
故障演练结合监控：通过压测、断网、节点下线验证告警有效性。
与业务联动：关键业务接口与 Nacos 指标关联，避免“指标正常但业务异常”。

六、常见排错与定位步骤#

指标无数据
- 检查 Nacos 是否启用 Actuator（路径可访问）。
- 检查 Prometheus targets 是否 UP。
- 检查防火墙/安全组端口（8848/9100）。

# 检查端口连通性
ss -lntp | grep 8848
curl -s http://127.0.0.1:8848/nacos/actuator/prometheus | wc -l

日志无采集
- 检查 Filebeat 是否读取权限。
- 校验路径是否正确、日志是否有新增。

systemctl status filebeat
tail -n 20 /var/log/filebeat/filebeat
ls -l /home/nacos/logs/

接口探测失败
- 验证 Nacos 服务是否存活、服务名是否存在。

curl -s "http://127.0.0.1:8848/nacos/v1/ns/service/list?pageNo=1&pageSize=10"

七、练习与实操#

练习 1：指标采集
- 搭建 Prometheus，添加 Nacos 与 Node Exporter 目标。
- 在 /targets 页面截图并标注 UP 状态。
练习 2：告警触发
- 手动停止一个 Nacos 节点，观察 NacosNodeDown 告警是否触发。
练习 3：日志检索
- 在 Elasticsearch 中检索 nacos.log 中包含 ERROR 的记录并导出 10 条。
练习 4：接口探测
- 用脚本探测 /nacos/v1/ns/instance/list，返回非 200 时输出告警信息。