17.7.7 典型场景告警规则示例与验证
本节提供典型场景告警规则示例与验证方法,覆盖节点、容器、中间件与应用指标。每类规则给出可直接落地的规则文件、验证命令、排错要点与演练练习,便于生产使用。
原理草图(规则评估与告警流转)
一、规则文件示例(含标签、注释与分级)
文件路径示例:/etc/prometheus/rules/alerts.yml
groups:
- name: node.rules
rules:
- alert: NodeDown
expr: up == 0
for: 2m
labels:
severity: P1
team: ops
service: node
annotations:
summary: "节点离线 {{ $labels.instance }}"
description: "Exporter 2 分钟不可达"
runbook: "https://runbook.example.com/node-down"
- alert: HighCPUUsage
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m]))*100) > 85
for: 5m
labels:
severity: P2
team: ops
service: node
annotations:
summary: "CPU 使用率高 {{ $labels.instance }}"
description: "CPU 使用率 > 85% 持续 5m"
- alert: LowMemoryAvailable
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: P2
team: ops
service: node
annotations:
summary: "内存可用率低 {{ $labels.instance }}"
description: "内存可用率 < 10% 持续 5m"
- name: middleware.rules
rules:
- alert: MySQLConnectionsHigh
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
for: 5m
labels:
severity: P2
team: db
service: mysql
annotations:
summary: "MySQL 连接接近上限 {{ $labels.instance }}"
description: "连接占用 > 80% 持续 5m"
- alert: Nginx5xxSpike
expr: sum by (instance)(rate(nginx_http_requests_total{status=~"5.."}[5m])) > 5
for: 2m
labels:
severity: P2
team: web
service: nginx
annotations:
summary: "Nginx 5xx 激增 {{ $labels.instance }}"
description: "5xx 每秒 > 5 持续 2m"
- name: k8s.rules
rules:
- alert: PodRestartTooMuch
expr: increase(kube_pod_container_status_restarts_total[10m]) > 3
for: 5m
labels:
severity: P2
team: sre
service: k8s
annotations:
summary: "Pod 重启过多 {{ $labels.pod }}"
description: "10m 内重启次数 > 3"
- alert: PodNotReady
expr: kube_pod_status_ready{condition="true"} == 0
for: 5m
labels:
severity: P1
team: sre
service: k8s
annotations:
summary: "Pod 不可用 {{ $labels.pod }}"
description: "Pod Ready 为 false 持续 5m"
二、安装与规则加载(示例流程)
# 1) 规则目录准备
sudo mkdir -p /etc/prometheus/rules
# 2) 将 alerts.yml 放入目录
sudo cp ./alerts.yml /etc/prometheus/rules/alerts.yml
# 3) Prometheus 主配置引用规则
sudo grep -n "rule_files" /etc/prometheus/prometheus.yml
/etc/prometheus/prometheus.yml 关键配置示例:
rule_files:
- /etc/prometheus/rules/*.yml
重载配置并验证生效:
# 使用HTTP热加载(需要启用 --web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload
# 检查Prometheus是否加载规则
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
三、PromQL验证与预期效果
# 在Prometheus UI或API验证表达式命中
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=up==0' | jq '.data.result | length'
预期效果:当目标离线时 result 非空;若为空说明指标无数据或目标仍在线。
四、Alertmanager 路由验证(简单接收器示例)
/etc/alertmanager/alertmanager.yml
route:
receiver: "default"
group_by: ["alertname","service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
receivers:
- name: "default"
webhook_configs:
- url: "http://localhost:5001/alerts"
启动一个本地Webhook接收器(验证投递)
# 使用netcat简易接收
nc -lk 5001
预期效果:触发告警后,终端可见 Alertmanager POST 的 JSON 数据。
五、典型场景规则示例速查
- 节点存活:up == 0,for: 2m
- CPU高:100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m]))*100) > 85
- 内存低:(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
- 磁盘低:(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})*100 < 15
- IO等待高:avg by (instance)(irate(node_cpu_seconds_total{mode="iowait"}[5m]))*100 > 20
- MySQL慢查询:rate(mysql_global_status_slow_queries[5m]) > 1
- Redis内存:redis_memory_used_bytes / redis_memory_max_bytes * 100 > 85
- Kafka欠同步:kafka_server_replicamanager_underreplicatedpartitions > 0
- SSL证书过期:probe_ssl_earliest_cert_expiry - time() < 86400 * 7
- URL可用性:probe_success == 0
六、故障排错清单(含命令与说明)
# 1) 规则语法错误
promtool check rules /etc/prometheus/rules/alerts.yml
# 2) 规则未加载
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
# 3) 指标缺失或采集失败
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job:.labels.job,health:.health}'
# 4) 表达式无数据
curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=your_expr' | jq
# 5) Alertmanager 未接收
curl -s http://localhost:9093/api/v2/status | jq '.cluster'
常见原因:
- 规则目录未被 rule_files 引用
- Prometheus 未热加载或重启
- Exporter 未启动或指标名称变化
- Alertmanager 路由条件不匹配
七、演练与验证步骤(可执行)
# 1) 人工触发:停止一个 exporter
sudo systemctl stop node_exporter
# 2) 观察告警生成
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | .labels.alertname'
# 3) 恢复服务
sudo systemctl start node_exporter
# 4) 验证告警恢复
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | .state'
预期效果:告警从 firing 转为 resolved,生命周期完整记录。
八、练习题(动手)
1) 编写磁盘 inode 使用率告警,阈值 85%,持续 10m,并使用 promtool 校验。
2) 为 Nginx 5xx 告警增加 runbook 链接并验证 Alertmanager webhook 输出是否包含该字段。
3) 模拟 Pod 重启,观察 increase(kube_pod_container_status_restarts_total[10m]) 命中情况并记录告警时间线。
九、落地建议
- 每类服务至少包含:可用性、容量、性能、错误率四类告警
- 分离“提示类”与“行动类”告警,降低夜间干扰
- 关键链路采用多指标联合触发,降低误报
- 为每条告警定义责任人、SLA 与处理步骤(runbook)