17.7.1 告警规则设计原则与分层

告警规则设计原则与分层#

告警规则设计应遵循“可行动、少噪声、可聚合、可维护”的原则。告警必须指向明确的故障或风险,且具备操作性;尽量避免对瞬时波动敏感,减少误报与重复告警;支持按服务、环境、地域、集群等维度聚合;规则命名、标签与注释统一规范,便于检索、路由与追踪。

分层设计常见为三层:基础设施层、应用服务层、业务指标层
- 基础设施层:主机/网络/存储/容器/节点健康
- 应用服务层:MySQL/Redis/Kafka/Nginx 等可用性与性能
- 业务指标层:SLA/关键业务指标(成功率、时延、吞吐)

原理草图:告警分层与路由目标

文章图片

规则命名与标签规范示例

# /etc/prometheus/rules/alerting.yml
groups:
- name: infra.node
  rules:
  - alert: NodeCPUHigh
    expr: avg by (instance,cluster) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.85
    for: 10m
    labels:
      severity: P2
      env: prod
      team: platform
      layer: infra
    annotations:
      summary: "节点CPU持续高负载"
      description: "实例 {{ $labels.instance }} 5分钟平均CPU使用率>85%,持续10分钟"

分层告警示例(阈值+持续时间+趋势)

1) 基础设施层(节点不可达)

- alert: NodeDown
  expr: up{job="node"} == 0
  for: 2m
  labels:
    severity: P1
    layer: infra
    team: platform
  annotations:
    summary: "节点不可达"
    description: "实例 {{ $labels.instance }} 采集失败超过2分钟"

2) 应用服务层(MySQL连接数耗尽趋势)

- alert: MySQLConnectionsNearLimit
  expr: (mysql_global_status_threads_connected / mysql_global_variables_max_connections) > 0.85
  for: 5m
  labels:
    severity: P2
    layer: app
    team: db
  annotations:
    summary: "MySQL连接数接近上限"
    description: "连接数使用率>85%持续5分钟"

3) 业务指标层(接口超时率)

- alert: ApiTimeoutRateHigh
  expr: (sum(rate(http_request_duration_seconds_bucket{le="2"}[5m])) 
        / sum(rate(http_request_duration_seconds_count[5m]))) < 0.95
  for: 10m
  labels:
    severity: P1
    layer: biz
    team: service
  annotations:
    summary: "接口超时率升高"
    description: "2秒内响应比例<95%持续10分钟"

记录规则降低噪声示例(预聚合)

# /etc/prometheus/rules/recording.yml
groups:
- name: infra.record
  rules:
  - record: job:node_cpu_usage:avg5m
    expr: avg by (job, instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

随后告警规则引用记录指标,提升性能与一致性:

- alert: NodeCPUHigh
  expr: job:node_cpu_usage:avg5m > 0.85
  for: 10m
  labels:
    severity: P2
    layer: infra

安装与加载规则(Prometheus)

# 1) 放置规则文件
sudo mkdir -p /etc/prometheus/rules
sudo cp alerting.yml /etc/prometheus/rules/alerting.yml
sudo cp recording.yml /etc/prometheus/rules/recording.yml

# 2) Prometheus配置引用规则文件
sudo tee /etc/prometheus/prometheus.yml >/dev/null <<'EOF'
global:
  scrape_interval: 15s
rule_files:
  - /etc/prometheus/rules/*.yml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['10.0.0.10:9100']
EOF

# 3) 重新加载配置(不重启)
curl -X POST http://127.0.0.1:9090/-/reload

命令解释:
- rule_files:加载告警/记录规则
- /-/reload:Prometheus热加载配置
- scrape_configs:定义抓取目标

验证规则是否生效

# 查看当前加载的规则
curl -s http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[].name'

# 在Prometheus UI中验证表达式
# 访问 http://127.0.0.1:9090/graph 输入 expr

排错清单(常见问题)
1) 规则不生效
- 检查语法:

promtool check rules /etc/prometheus/rules/alerting.yml
  • 检查是否加载成功:
curl -s http://127.0.0.1:9090/api/v1/status/config | grep rule_files -n

2) 告警噪声过多
- 增加 for 时长
- 使用 recording rules 预聚合
- 通过标签聚合:

expr: sum by (cluster, service) (rate(http_requests_total{code=~"5.."}[5m])) > 5

3) 告警未路由
- 检查必需标签:envseverityteamlayer
- 确认Alertmanager匹配规则

练习与验证
1) 设计一条“磁盘即将耗尽”的基础设施告警
- 要求:使用 80% 阈值 + 持续 15m + 按 instance 聚合
2) 设计一条“Nginx 5xx 升高”的应用告警
- 要求:5xx 比例 > 1% 持续 10m
3) 通过 promtool 验证规则文件并热加载
4) 人为制造告警:临时调高阈值或使用 vector(1) 模拟

示例答案参考(Nginx 5xx)

- alert: Nginx5xxRateHigh
  expr: (sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
        / sum(rate(nginx_http_requests_total[5m]))) > 0.01
  for: 10m
  labels:
    severity: P2
    layer: app
    team: web
  annotations:
    summary: "Nginx 5xx比例升高"
    description: "5xx比例>1%持续10分钟"

通过以上分层、命名、聚合与记录规则实践,可有效降低噪声、提升可维护性,并与告警路由/升级机制协同形成闭环。