19.5.3 告警体系设计与降噪策略
告警体系设计的核心目标是在保证及时发现故障的前提下,最大限度降低噪声与无效干预。建议围绕业务影响与用户体验构建分级模型,并与SLO/SLA对齐,形成“服务可用性—关键功能—资源容量—性能劣化—异常行为”的层次。告警对象覆盖基础设施、平台组件、中间件与业务指标,明确触发条件、责任归属与处置时限。
告警体系原理草图(含采集、规则、通知、闭环):
告警分级建议采用P0/P1/P2/P3:
- P0:全局不可用或核心链路中断
- P1:关键功能不可用或严重降级
- P2:性能劣化与容量风险
- P3:非核心异常与趋势性提醒
每个级别需定义通知渠道、升级路径、值班策略与响应SOP,避免“全同优先级”。
示例:Prometheus + Alertmanager 告警分级与降噪配置#
安装与基础验证(示例为二进制方式,Linux x86_64):
# 1) 下载并解压
cd /opt
wget https://github.com/prometheus/prometheus/releases/download/v2.49.0/prometheus-2.49.0.linux-amd64.tar.gz
tar -xzf prometheus-2.49.0.linux-amd64.tar.gz
ln -s prometheus-2.49.0.linux-amd64 prometheus
# 2) 启动 Prometheus(示例端口 9090)
/opt/prometheus/prometheus \
--config.file=/opt/prometheus/prometheus.yml \
--storage.tsdb.path=/opt/prometheus/data &
# 3) 检查端口
ss -lntp | grep 9090
Prometheus 配置示例(含告警规则文件):
# /opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /opt/prometheus/rules/alerts.yml
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["10.0.0.10:9100","10.0.0.11:9100"]
告警规则示例(分级、持续时间、滑动窗口):
# /opt/prometheus/rules/alerts.yml
groups:
- name: linux_node_alerts
rules:
- alert: NodeDown_P0
expr: up{job="node"} == 0
for: 2m
labels:
severity: P0
team: ops
annotations:
summary: "Node down"
description: "节点 {{ $labels.instance }} 2分钟无响应"
- alert: CpuHigh_P2
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: P2
team: ops
annotations:
summary: "CPU usage high"
description: "节点 {{ $labels.instance }} CPU持续10分钟>85%"
- alert: DiskAlmostFull_P1
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} /
node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.1
for: 5m
labels:
severity: P1
team: ops
annotations:
summary: "Disk almost full"
description: "节点 {{ $labels.instance }} 磁盘可用率<10%"
Alertmanager 路由与降噪(分组、去重、抑制、静默):
# /opt/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ["alertname","instance","severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
receiver: "ops-im"
routes:
- matchers:
- severity="P0"
receiver: "ops-phone"
repeat_interval: 15m
- matchers:
- severity="P1"
receiver: "ops-im"
repeat_interval: 30m
receivers:
- name: "ops-im"
webhook_configs:
- url: "http://im-gateway.local/alert"
- name: "ops-phone"
webhook_configs:
- url: "http://sms-gateway.local/alert"
inhibit_rules:
# 根因抑制:当同一实例出现P0时,抑制P1/P2
- source_matchers: ["severity=P0"]
target_matchers: ["severity=P1|P2"]
equal: ["instance"]
# 维护期静默示例在 Alertmanager UI 或 API 中配置
告警降噪策略(含可执行示例)#
1) 滑动窗口与连续触发
# for 10m 表示持续10分钟触发才告警,避免瞬时波动
for: 10m
2) 合并与去重
# group_by + group_wait/group_interval 防止告警风暴
group_by: ["alertname","instance","severity"]
group_wait: 30s
group_interval: 5m
3) 维护期静默(API示例)
# 创建静默:维护窗口内抑制某服务告警
curl -X POST http://alertmanager.local/api/v2/silences \
-H 'Content-Type: application/json' \
-d '{
"matchers":[{"name":"job","value":"node","isRegex":false}],
"startsAt":"2024-06-01T02:00:00Z",
"endsAt":"2024-06-01T04:00:00Z",
"createdBy":"ops",
"comment":"maintenance window"
}'
告警排错指南(常见问题+命令)#
1) 告警不触发
# 规则是否被加载
curl -s http://prometheus.local:9090/api/v1/rules | jq '.data.groups[].name'
# 目标是否正常采集
curl -s http://prometheus.local:9090/api/v1/targets | jq '.data.activeTargets[]|.health'
2) 告警触发但未通知
# Alertmanager 配置是否生效
curl -s http://alertmanager.local:9093/api/v2/status | jq '.config'
# 是否被抑制或静默
curl -s http://alertmanager.local:9093/api/v2/silences | jq '.[].status'
3) 告警风暴
# 观察某时间段的告警数量趋势(PromQL示例)
sum by (severity) (ALERTS{alertstate="firing"})
告警通知内容模板(动作可执行)#
告警级别: P1
服务: user-api
实例: 10.0.0.10:9100
指标: DiskAlmostFull
影响: 磁盘可用率<10% 持续5分钟
建议动作:
1) df -h /data
2) du -sh /data/* | sort -hr | head
3) 清理日志或扩容
关联变更: 最近24小时无发布
练习(含预期结果)#
1) 练习:验证“持续触发”
- 操作:将某节点 node_exporter 停止 2 分钟
- 命令:systemctl stop node_exporter
- 预期:触发 NodeDown_P0 告警,2分钟后通知发送,恢复后自动恢复
2) 练习:验证“抑制规则”
- 操作:人工让节点下线,同时制造CPU高
- 预期:P0触发后,P2 CPU告警被抑制,不重复通知
3) 练习:创建维护期静默
- 操作:通过 API 创建 1 小时静默
- 预期:该窗口内不发送该服务告警;窗口结束后自动恢复
通过上述分级、聚合、抑制与静默策略,结合标准化告警字典与SOP闭环流程,可显著降低噪声并提升告警的可行动性与业务价值对齐程度。