17.1.4 服务发现与抓取机制

服务发现与抓取机制是 Prometheus 实现大规模动态监控的核心能力。其流程为：服务发现生成目标列表 → 目标重标记（relabel）过滤与改写 → 形成最终抓取目标 → 按抓取间隔拉取 /metrics → 写入 TSDB。

原理草图：

一、安装与基础环境（示例以 Linux 二进制为主）

# 1) 下载并解压
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.49.1/prometheus-2.49.1.linux-amd64.tar.gz
tar xf prometheus-2.49.1.linux-amd64.tar.gz
sudo mv prometheus-2.49.1.linux-amd64 /opt/prometheus

# 2) 创建配置与数据目录
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp /opt/prometheus/prometheus.yml /etc/prometheus/

# 3) 启动 Prometheus（前台示例）
/opt/prometheus/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=:9090

命令说明：
- --config.file 指定主配置文件；
- --storage.tsdb.path 指定 TSDB 数据目录；
- --web.listen-address 指定 Web/HTTP 端口。

二、抓取与服务发现配置示例（含完整流程）
1）静态服务发现（适合固定主机）

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  scrape_timeout: 10s

scrape_configs:
  - job_name: "node_static"
    static_configs:
      - targets: ["192.168.10.10:9100", "192.168.10.11:9100"]
        labels:
          env: "prod"
          role: "node"

2）文件服务发现（动态更新无需重启）

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: "node_file_sd"
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/node_targets.json
        refresh_interval: 30s

// /etc/prometheus/targets/node_targets.json
[
  {
    "targets": ["192.168.10.12:9100", "192.168.10.13:9100"],
    "labels": { "env": "staging", "role": "node" }
  }
]

更新文件后触发重新加载：

# 发送热加载信号
curl -X POST http://localhost:9090/-/reload

命令说明：
- /-/reload 触发 Prometheus 重新读取配置；
- file_sd_configs 支持外部系统写入 JSON 文件实现动态目标管理。

3）目标重标记与指标重标记（降低高基数风险）

scrape_configs:
  - job_name: "app"
    static_configs:
      - targets: ["10.0.1.21:8080"]
        labels:
          app: "order"
          env: "prod"

    relabel_configs:
      # 将 instance 重写为服务名+端口
      - source_labels: [__address__]
        target_label: instance
        replacement: "order-service:8080"

      # 只保留 prod 环境
      - source_labels: [env]
        regex: "prod"
        action: keep

    metric_relabel_configs:
      # 丢弃高基数的 label
      - source_labels: [user_id]
        regex: ".+"
        action: labeldrop

4）抓取参数与安全（认证/TLS 示例）

scrape_configs:
  - job_name: "secure_app"
    scheme: https
    metrics_path: /metrics
    scrape_interval: 30s
    scrape_timeout: 10s
    static_configs:
      - targets: ["secure-app.local:9443"]
    basic_auth:
      username: "prom"
      password: "prom-pass"
    tls_config:
      ca_file: /etc/prometheus/certs/ca.crt
      insecure_skip_verify: false

三、验证与排错
1）检查目标状态

# Web UI: http://localhost:9090/targets
# 也可以通过 API 查看
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, instance: .labels.instance, health: .health, lastError: .lastError}'

命令说明：
- api/v1/targets 返回目标健康、错误信息与最后抓取时间。

2）常见问题与处理
- 抓取超时：
处理方式：增大 scrape_timeout，或检查目标 /metrics 响应时延。
- 目标不可达：
处理方式：检查防火墙与端口，确认 Exporter 正在运行。

# 检查端口连通
nc -zv 192.168.10.10 9100

# 直接访问指标端点
curl -s http://192.168.10.10:9100/metrics | head

重标记误删目标：
处理方式：在 relabel_configs 中增加 action: keep 前先使用 action: replace 打印调试标签，或临时移除规则验证。

四、抓取策略建议（简要）
- 关键组件（数据库、网关）可用 15s；低频业务指标 60s 或更长。
- 控制抓取并发，避免同时对大量目标发起请求造成网络峰值。
- 尽量使用 Pull 模式，Pushgateway 仅用于短生命周期任务补充。

五、练习
1）使用 file_sd_configs 新增 2 台节点，并通过 /-/reload 生效。
2）为 job_name: app 添加重标记规则，将 env=dev 的目标过滤掉。
3）模拟抓取超时：将目标端口改为不可达，观察 /targets 中 lastError 信息。
4）给安全抓取添加 Basic Auth，并用 curl -u 验证目标端点可访问。