17.4.7 指标命名、标签与采集频率策略

指标命名、标签与采集频率是Prometheus可用性与成本控制的核心。以下给出命名与标签策略、采集频率的原理与实操，并提供安装校验、排错与练习。

1. 原理草图：命名/标签/频率如何影响成本与性能#

2. 指标命名规范与示例#

命名规则
- 采用snake_case，避免驼峰与空格。
- 计数器以_total结尾；直方图/摘要使用_bucket、_sum、_count。
- 单位写入指标名，如_seconds、_bytes，避免重复标注。
- 避免动态命名（包含ID/用户名）以防基数爆炸。

示例：自定义指标暴露（文本格式）
文件路径：/var/lib/node_exporter/textfile_collector/app_metrics.prom

# HELP http_requests_total Total http requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status_code="200"} 1024

# HELP request_latency_seconds Request latency
# TYPE request_latency_seconds histogram
request_latency_seconds_bucket{le="0.1"} 120
request_latency_seconds_bucket{le="0.5"} 300
request_latency_seconds_bucket{le="1"} 450
request_latency_seconds_bucket{le="+Inf"} 500
request_latency_seconds_sum 120.5
request_latency_seconds_count 500

命令验证

# 验证指标格式是否正确
curl -s http://localhost:9100/metrics | grep -E 'http_requests_total|request_latency_seconds'
# 预期：能看到上面新增的指标

3. 标签设计原则与示例#

原则
- 标签用于区分维度，避免“属性滥用”。
- 基础标签常用：instance、job、cluster、env、region。
- 高基数风险：request_id、user_id等避免上报。

示例：合理与不合理标签对比

# 合理：低基数维度
http_requests_total{method="GET",status_code="200",env="prod"} 1024

# 不合理：高基数维度导致爆炸
http_requests_total{request_id="ab123-xyz-9999"} 1

检查高基数标签（PromQL）

# 查看某指标的时序数量
count({__name__="http_requests_total"})
# 查看 label 维度数量
count(count by (request_id) (http_requests_total))

4. 采集频率策略与配置示例#

策略
- 默认15s适合大多数系统。
- 高频指标（延迟）5s~10s，低频指标（容量/版本）30s~5m。
- 采集频率与规则评估间隔保持一致或倍数关系。

Prometheus 配置示例
文件路径：/etc/prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "node"
    scrape_interval: 15s
    static_configs:
      - targets: ["10.0.0.11:9100","10.0.0.12:9100"]

  - job_name: "app_latency"
    scrape_interval: 5s
    static_configs:
      - targets: ["10.0.0.21:8080"]

  - job_name: "cmdb_info"
    scrape_interval: 2m
    static_configs:
      - targets: ["10.0.0.31:9091"]

命令说明

# 检查配置语法
promtool check config /etc/prometheus/prometheus.yml

# 重新加载配置（不重启）
curl -X POST http://localhost:9090/-/reload

5. 安装与校验（含命令解释）#

安装 Node Exporter 并启用 textfile_collector

# 下载并解压
cd /opt
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-1.7.0.linux-amd64.tar.gz
ln -s /opt/node_exporter-1.7.0.linux-amd64 /opt/node_exporter

# 创建 textfile 目录
mkdir -p /var/lib/node_exporter/textfile_collector

# 启动（含 textfile_collector）
/opt/node_exporter/node_exporter \
  --collector.textfile \
  --collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
  --web.listen-address=":9100"

命令解释
- --collector.textfile：开启文本文件采集。
- --collector.textfile.directory：指定自定义指标目录。
- --web.listen-address：监听端口。

6. 常见问题排查#

1）指标未出现

# 检查指标文件是否有权限
ls -l /var/lib/node_exporter/textfile_collector

# 检查格式错误（常见：无 HELP/TYPE、格式不完整）
curl -s http://localhost:9100/metrics | tail -n 20

排错思路：文件权限不足、文本格式错误、node_exporter未启用 textfile_collector。

2）Prometheus 抓取失败

# Prometheus 目标状态
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'
# 预期：所有目标为 "up"

排错思路：网络不通、防火墙阻断、端口错误、scrape_interval过短导致超时。

3）高基数导致内存飙升

# 找到时序最多的指标
topk(5, count by (__name__)({__name__!=""}))

排错思路：去掉高基数标签、降低频率、聚合后上报。

7. 练习#

为业务指标设计3个规范名称（含单位与类型），并给出合理标签集合。
将app_latency抓取间隔改为10s，验证Prometheus是否成功加载。
使用PromQL统计http_requests_total按status_code维度的数量。
人为加入一个request_id标签，观察时序数量变化并记录结果。

# 练习3：按状态码统计请求数
sum by (status_code) (rate(http_requests_total[1m]))