19.11.3 监控告警平台建设与SLO治理案例
监控告警平台建设与SLO治理案例#
本案例以统一监控告警平台为目标,覆盖主机、容器、数据库、中间件与业务链路指标,结合SLO治理形成闭环运维体系。平台以Prometheus为核心采集引擎,配套告警、事件与自愈组件,形成“采集—存储—展示—告警—处置—复盘”的完整链路。
1. 建设目标与范围
- 统一指标体系:主机、网络、应用、数据库、容器、K8s、消息队列等核心组件指标标准化。
- 告警统一入口:多渠道告警、事件关联、告警降噪与分级策略。
- SLO治理闭环:定义服务级目标、预算与错误预算消耗机制。
- 可观测性扩展:日志、链路与指标联动,实现快速定位与复盘。
2. 平台架构与组件(含原理草图)
3. 安装与基础接入示例(Prometheus + Node Exporter + Alertmanager + Grafana)
- 安装目录规划:/opt/monitoring
- 端口:Prometheus 9090,Alertmanager 9093,Grafana 3000
# 1) 创建目录
mkdir -p /opt/monitoring/{prometheus,alertmanager,grafana,exporters}
cd /opt/monitoring
# 2) 安装 Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.49.1/prometheus-2.49.1.linux-amd64.tar.gz
tar -xf prometheus-2.49.1.linux-amd64.tar.gz
mv prometheus-2.49.1.linux-amd64 prometheus
# 3) 安装 Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xf node_exporter-1.7.0.linux-amd64.tar.gz
mv node_exporter-1.7.0.linux-amd64 exporters/node
# 4) 安装 Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar -xf alertmanager-0.27.0.linux-amd64.tar.gz
mv alertmanager-0.27.0.linux-amd64 alertmanager
Prometheus 配置示例(/opt/monitoring/prometheus/prometheus.yml)
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["127.0.0.1:9093"]
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["127.0.0.1:9100"]
rule_files:
- "/opt/monitoring/prometheus/rules/*.yml"
Alertmanager 配置示例(/opt/monitoring/alertmanager/alertmanager.yml)
route:
receiver: "default"
group_by: ["alertname","instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
receivers:
- name: "default"
email_configs:
- to: "ops@example.com"
from: "alert@example.com"
smarthost: "smtp.example.com:25"
启动命令
# 启动 node_exporter
/opt/monitoring/exporters/node/node_exporter &
# 启动 Prometheus
/opt/monitoring/prometheus/prometheus \
--config.file=/opt/monitoring/prometheus/prometheus.yml \
--storage.tsdb.path=/opt/monitoring/prometheus/data &
# 启动 Alertmanager
/opt/monitoring/alertmanager/alertmanager \
--config.file=/opt/monitoring/alertmanager/alertmanager.yml &
# Grafana 安装(示例以官方 yum 为例)
yum install -y https://dl.grafana.com/oss/release/grafana-10.2.2-1.x86_64.rpm
systemctl enable --now grafana-server
验证效果
# Prometheus targets 检查
curl -s http://127.0.0.1:9090/api/v1/targets | grep -A2 health
# Node Exporter 指标检查
curl -s http://127.0.0.1:9100/metrics | head
4. 关键指标与告警规则示例
# /opt/monitoring/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
rules:
- alert: NodeCPUHigh
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: P2
annotations:
summary: "CPU使用率过高"
description: "实例 {{ $labels.instance }} CPU > 85% 持续10分钟"
规则加载验证
# 触发 Prometheus 规则重载
curl -X POST http://127.0.0.1:9090/-/reload
# 查询告警是否生效
curl -s http://127.0.0.1:9090/api/v1/alerts | jq '.data.alerts[]?.labels.alertname'
5. SLO 定义与计算示例
- SLI:HTTP 成功率
- SLO:30天成功率 99.9%
# 成功率 SLI
sum(rate(http_requests_total{job="app",code!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="app"}[5m]))
# 30天错误预算消耗(简化示例)
1 - (
sum(rate(http_requests_total{job="app",code!~"5.."}[30d]))
/
sum(rate(http_requests_total{job="app"}[30d]))
)
SLO 看板建议字段
- 服务名称、指标名称、SLO目标、当前SLI、预算消耗百分比、剩余预算天数
6. 典型中间件接入示例(MySQL + Nginx + Redis)
# 安装 mysql_exporter(示例)
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.1/mysqld_exporter-0.15.1.linux-amd64.tar.gz
tar -xf mysqld_exporter-0.15.1.linux-amd64.tar.gz
mv mysqld_exporter-0.15.1.linux-amd64 /opt/monitoring/exporters/mysql
# 配置 MySQL 账户
mysql -e "CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'exporter'; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';"
cat >/opt/monitoring/exporters/mysql/.my.cnf <<'EOF'
[client]
user=exporter
password=exporter
EOF
# 启动 mysql_exporter
/opt/monitoring/exporters/mysql/mysqld_exporter \
--config.my-cnf=/opt/monitoring/exporters/mysql/.my.cnf &
Prometheus 增加抓取
- job_name: "mysql"
static_configs:
- targets: ["127.0.0.1:9104"]
7. 告警治理与降噪策略(带配置示例)
# Alertmanager 路由示例:抑制与分级
route:
receiver: "default"
routes:
- matchers: ["severity=P1"]
receiver: "oncall"
group_wait: 10s
repeat_interval: 30m
inhibit_rules:
- source_matchers: ["severity=P1"]
target_matchers: ["severity=P3"]
equal: ["instance","alertname"]
8. 常见问题排查(命令与现象对应)
- Prometheus 无法抓取指标
# 检查目标是否可达
curl -s http://127.0.0.1:9100/metrics | head
# 查看 Prometheus targets 状态
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[]|{job:.labels.job,health:.health,lastError:.lastError}'
- 告警未触发
# 检查规则是否加载
curl -s http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[].rules[]|select(.name=="NodeCPUHigh")'
# 检查告警表达式
curl -s "http://127.0.0.1:9090/api/v1/query?query=100-(avg by(instance)(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))*100)"
- Grafana 无数据
# 确认数据源状态
curl -s http://127.0.0.1:3000/api/datasources | jq '.[].name'
# UI 中检查 Prometheus URL 是否为 http://127.0.0.1:9090
9. 关键落地步骤(含示例动作)
1) 统一服务目录与指标规范(定义 job、instance、env 标签标准)
2) 核心服务与关键链路监控先行(先接入交易链路)
3) 告警策略分级与降噪规则落地(P1/P2/P3 + 抑制)
4) SLO与错误预算试运行(30天试跑)
5) 复盘机制与持续优化闭环(每周SLO评审)
10. 练习与实操
- 练习1:为本机安装 node_exporter,完成 Prometheus 抓取并在 Grafana 创建 CPU 使用率面板。
- 练习2:编写一个自定义告警(磁盘使用率>80%持续10分钟),通过 Alertmanager 发送到邮箱。
- 练习3:模拟 5xx 错误并用 PromQL 计算 1小时错误预算消耗。
- 练习4:启用 inhibit_rules,验证 P1 触发时 P3 被抑制。
本案例展示了监控告警平台从工具搭建到SLO治理的演进路径,实现稳定性工程与运维平台化的协同落地。