17.7.7 典型场景告警规则示例与验证

本节提供典型场景告警规则示例与验证方法,覆盖节点、容器、中间件与应用指标。每类规则给出可直接落地的规则文件、验证命令、排错要点与演练练习,便于生产使用。

原理草图(规则评估与告警流转)

文章图片

一、规则文件示例(含标签、注释与分级)
文件路径示例:/etc/prometheus/rules/alerts.yml

groups:
- name: node.rules
  rules:
  - alert: NodeDown
    expr: up == 0
    for: 2m
    labels:
      severity: P1
      team: ops
      service: node
    annotations:
      summary: "节点离线 {{ $labels.instance }}"
      description: "Exporter 2 分钟不可达"
      runbook: "https://runbook.example.com/node-down"

  - alert: HighCPUUsage
    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m]))*100) > 85
    for: 5m
    labels:
      severity: P2
      team: ops
      service: node
    annotations:
      summary: "CPU 使用率高 {{ $labels.instance }}"
      description: "CPU 使用率 > 85% 持续 5m"

  - alert: LowMemoryAvailable
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
    for: 5m
    labels:
      severity: P2
      team: ops
      service: node
    annotations:
      summary: "内存可用率低 {{ $labels.instance }}"
      description: "内存可用率 < 10% 持续 5m"

- name: middleware.rules
  rules:
  - alert: MySQLConnectionsHigh
    expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
    for: 5m
    labels:
      severity: P2
      team: db
      service: mysql
    annotations:
      summary: "MySQL 连接接近上限 {{ $labels.instance }}"
      description: "连接占用 > 80% 持续 5m"

  - alert: Nginx5xxSpike
    expr: sum by (instance)(rate(nginx_http_requests_total{status=~"5.."}[5m])) > 5
    for: 2m
    labels:
      severity: P2
      team: web
      service: nginx
    annotations:
      summary: "Nginx 5xx 激增 {{ $labels.instance }}"
      description: "5xx 每秒 > 5 持续 2m"

- name: k8s.rules
  rules:
  - alert: PodRestartTooMuch
    expr: increase(kube_pod_container_status_restarts_total[10m]) > 3
    for: 5m
    labels:
      severity: P2
      team: sre
      service: k8s
    annotations:
      summary: "Pod 重启过多 {{ $labels.pod }}"
      description: "10m 内重启次数 > 3"

  - alert: PodNotReady
    expr: kube_pod_status_ready{condition="true"} == 0
    for: 5m
    labels:
      severity: P1
      team: sre
      service: k8s
    annotations:
      summary: "Pod 不可用 {{ $labels.pod }}"
      description: "Pod Ready 为 false 持续 5m"

二、安装与规则加载(示例流程)

# 1) 规则目录准备
sudo mkdir -p /etc/prometheus/rules

# 2) 将 alerts.yml 放入目录
sudo cp ./alerts.yml /etc/prometheus/rules/alerts.yml

# 3) Prometheus 主配置引用规则
sudo grep -n "rule_files" /etc/prometheus/prometheus.yml

/etc/prometheus/prometheus.yml 关键配置示例:

rule_files:
  - /etc/prometheus/rules/*.yml

重载配置并验证生效:

# 使用HTTP热加载(需要启用 --web.enable-lifecycle)
curl -X POST http://localhost:9090/-/reload

# 检查Prometheus是否加载规则
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'

三、PromQL验证与预期效果

# 在Prometheus UI或API验证表达式命中
curl -G http://localhost:9090/api/v1/query \
  --data-urlencode 'query=up==0' | jq '.data.result | length'

预期效果:当目标离线时 result 非空;若为空说明指标无数据或目标仍在线。

四、Alertmanager 路由验证(简单接收器示例)
/etc/alertmanager/alertmanager.yml

route:
  receiver: "default"
  group_by: ["alertname","service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h

receivers:
- name: "default"
  webhook_configs:
  - url: "http://localhost:5001/alerts"

启动一个本地Webhook接收器(验证投递)

# 使用netcat简易接收
nc -lk 5001

预期效果:触发告警后,终端可见 Alertmanager POST 的 JSON 数据。

五、典型场景规则示例速查
- 节点存活:up == 0for: 2m
- CPU高:100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m]))*100) > 85
- 内存低:(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
- 磁盘低:(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})*100 < 15
- IO等待高:avg by (instance)(irate(node_cpu_seconds_total{mode="iowait"}[5m]))*100 > 20
- MySQL慢查询:rate(mysql_global_status_slow_queries[5m]) > 1
- Redis内存:redis_memory_used_bytes / redis_memory_max_bytes * 100 > 85
- Kafka欠同步:kafka_server_replicamanager_underreplicatedpartitions > 0
- SSL证书过期:probe_ssl_earliest_cert_expiry - time() < 86400 * 7
- URL可用性:probe_success == 0

六、故障排错清单(含命令与说明)

# 1) 规则语法错误
promtool check rules /etc/prometheus/rules/alerts.yml

# 2) 规则未加载
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'

# 3) 指标缺失或采集失败
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job:.labels.job,health:.health}'

# 4) 表达式无数据
curl -G http://localhost:9090/api/v1/query --data-urlencode 'query=your_expr' | jq

# 5) Alertmanager 未接收
curl -s http://localhost:9093/api/v2/status | jq '.cluster'

常见原因:
- 规则目录未被 rule_files 引用
- Prometheus 未热加载或重启
- Exporter 未启动或指标名称变化
- Alertmanager 路由条件不匹配

七、演练与验证步骤(可执行)

# 1) 人工触发:停止一个 exporter
sudo systemctl stop node_exporter

# 2) 观察告警生成
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | .labels.alertname'

# 3) 恢复服务
sudo systemctl start node_exporter

# 4) 验证告警恢复
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | .state'

预期效果:告警从 firing 转为 resolved,生命周期完整记录。

八、练习题(动手)
1) 编写磁盘 inode 使用率告警,阈值 85%,持续 10m,并使用 promtool 校验。
2) 为 Nginx 5xx 告警增加 runbook 链接并验证 Alertmanager webhook 输出是否包含该字段。
3) 模拟 Pod 重启,观察 increase(kube_pod_container_status_restarts_total[10m]) 命中情况并记录告警时间线。

九、落地建议
- 每类服务至少包含:可用性、容量、性能、错误率四类告警
- 分离“提示类”与“行动类”告警,降低夜间干扰
- 关键链路采用多指标联合触发,降低误报
- 为每条告警定义责任人、SLA 与处理步骤(runbook)