17.7.1 告警规则设计原则与分层
告警规则设计原则与分层#
告警规则设计应遵循“可行动、少噪声、可聚合、可维护”的原则。告警必须指向明确的故障或风险,且具备操作性;尽量避免对瞬时波动敏感,减少误报与重复告警;支持按服务、环境、地域、集群等维度聚合;规则命名、标签与注释统一规范,便于检索、路由与追踪。
分层设计常见为三层:基础设施层、应用服务层、业务指标层。
- 基础设施层:主机/网络/存储/容器/节点健康
- 应用服务层:MySQL/Redis/Kafka/Nginx 等可用性与性能
- 业务指标层:SLA/关键业务指标(成功率、时延、吞吐)
原理草图:告警分层与路由目标
规则命名与标签规范示例
# /etc/prometheus/rules/alerting.yml
groups:
- name: infra.node
rules:
- alert: NodeCPUHigh
expr: avg by (instance,cluster) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.85
for: 10m
labels:
severity: P2
env: prod
team: platform
layer: infra
annotations:
summary: "节点CPU持续高负载"
description: "实例 {{ $labels.instance }} 5分钟平均CPU使用率>85%,持续10分钟"
分层告警示例(阈值+持续时间+趋势)
1) 基础设施层(节点不可达)
- alert: NodeDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: P1
layer: infra
team: platform
annotations:
summary: "节点不可达"
description: "实例 {{ $labels.instance }} 采集失败超过2分钟"
2) 应用服务层(MySQL连接数耗尽趋势)
- alert: MySQLConnectionsNearLimit
expr: (mysql_global_status_threads_connected / mysql_global_variables_max_connections) > 0.85
for: 5m
labels:
severity: P2
layer: app
team: db
annotations:
summary: "MySQL连接数接近上限"
description: "连接数使用率>85%持续5分钟"
3) 业务指标层(接口超时率)
- alert: ApiTimeoutRateHigh
expr: (sum(rate(http_request_duration_seconds_bucket{le="2"}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m]))) < 0.95
for: 10m
labels:
severity: P1
layer: biz
team: service
annotations:
summary: "接口超时率升高"
description: "2秒内响应比例<95%持续10分钟"
记录规则降低噪声示例(预聚合)
# /etc/prometheus/rules/recording.yml
groups:
- name: infra.record
rules:
- record: job:node_cpu_usage:avg5m
expr: avg by (job, instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
随后告警规则引用记录指标,提升性能与一致性:
- alert: NodeCPUHigh
expr: job:node_cpu_usage:avg5m > 0.85
for: 10m
labels:
severity: P2
layer: infra
安装与加载规则(Prometheus)
# 1) 放置规则文件
sudo mkdir -p /etc/prometheus/rules
sudo cp alerting.yml /etc/prometheus/rules/alerting.yml
sudo cp recording.yml /etc/prometheus/rules/recording.yml
# 2) Prometheus配置引用规则文件
sudo tee /etc/prometheus/prometheus.yml >/dev/null <<'EOF'
global:
scrape_interval: 15s
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['10.0.0.10:9100']
EOF
# 3) 重新加载配置(不重启)
curl -X POST http://127.0.0.1:9090/-/reload
命令解释:
- rule_files:加载告警/记录规则
- /-/reload:Prometheus热加载配置
- scrape_configs:定义抓取目标
验证规则是否生效
# 查看当前加载的规则
curl -s http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[].name'
# 在Prometheus UI中验证表达式
# 访问 http://127.0.0.1:9090/graph 输入 expr
排错清单(常见问题)
1) 规则不生效
- 检查语法:
promtool check rules /etc/prometheus/rules/alerting.yml
- 检查是否加载成功:
curl -s http://127.0.0.1:9090/api/v1/status/config | grep rule_files -n
2) 告警噪声过多
- 增加 for 时长
- 使用 recording rules 预聚合
- 通过标签聚合:
expr: sum by (cluster, service) (rate(http_requests_total{code=~"5.."}[5m])) > 5
3) 告警未路由
- 检查必需标签:env、severity、team、layer
- 确认Alertmanager匹配规则
练习与验证
1) 设计一条“磁盘即将耗尽”的基础设施告警
- 要求:使用 80% 阈值 + 持续 15m + 按 instance 聚合
2) 设计一条“Nginx 5xx 升高”的应用告警
- 要求:5xx 比例 > 1% 持续 10m
3) 通过 promtool 验证规则文件并热加载
4) 人为制造告警:临时调高阈值或使用 vector(1) 模拟
示例答案参考(Nginx 5xx)
- alert: Nginx5xxRateHigh
expr: (sum(rate(nginx_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(nginx_http_requests_total[5m]))) > 0.01
for: 10m
labels:
severity: P2
layer: app
team: web
annotations:
summary: "Nginx 5xx比例升高"
description: "5xx比例>1%持续10分钟"
通过以上分层、命名、聚合与记录规则实践,可有效降低噪声、提升可维护性,并与告警路由/升级机制协同形成闭环。