19.11.3 监控告警平台建设与SLO治理案例

监控告警平台建设与SLO治理案例#

本案例以统一监控告警平台为目标,覆盖主机、容器、数据库、中间件与业务链路指标,结合SLO治理形成闭环运维体系。平台以Prometheus为核心采集引擎,配套告警、事件与自愈组件,形成“采集—存储—展示—告警—处置—复盘”的完整链路。

1. 建设目标与范围
- 统一指标体系:主机、网络、应用、数据库、容器、K8s、消息队列等核心组件指标标准化。
- 告警统一入口:多渠道告警、事件关联、告警降噪与分级策略。
- SLO治理闭环:定义服务级目标、预算与错误预算消耗机制。
- 可观测性扩展:日志、链路与指标联动,实现快速定位与复盘。

2. 平台架构与组件(含原理草图)

文章图片

3. 安装与基础接入示例(Prometheus + Node Exporter + Alertmanager + Grafana)
- 安装目录规划:/opt/monitoring
- 端口:Prometheus 9090,Alertmanager 9093,Grafana 3000

# 1) 创建目录
mkdir -p /opt/monitoring/{prometheus,alertmanager,grafana,exporters}
cd /opt/monitoring

# 2) 安装 Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.49.1/prometheus-2.49.1.linux-amd64.tar.gz
tar -xf prometheus-2.49.1.linux-amd64.tar.gz
mv prometheus-2.49.1.linux-amd64 prometheus

# 3) 安装 Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xf node_exporter-1.7.0.linux-amd64.tar.gz
mv node_exporter-1.7.0.linux-amd64 exporters/node

# 4) 安装 Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar -xf alertmanager-0.27.0.linux-amd64.tar.gz
mv alertmanager-0.27.0.linux-amd64 alertmanager

Prometheus 配置示例(/opt/monitoring/prometheus/prometheus.yml)

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["127.0.0.1:9093"]

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["127.0.0.1:9100"]

rule_files:
  - "/opt/monitoring/prometheus/rules/*.yml"

Alertmanager 配置示例(/opt/monitoring/alertmanager/alertmanager.yml)

route:
  receiver: "default"
  group_by: ["alertname","instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h

receivers:
- name: "default"
  email_configs:
  - to: "ops@example.com"
    from: "alert@example.com"
    smarthost: "smtp.example.com:25"

启动命令

# 启动 node_exporter
/opt/monitoring/exporters/node/node_exporter &

# 启动 Prometheus
/opt/monitoring/prometheus/prometheus \
  --config.file=/opt/monitoring/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/monitoring/prometheus/data &

# 启动 Alertmanager
/opt/monitoring/alertmanager/alertmanager \
  --config.file=/opt/monitoring/alertmanager/alertmanager.yml &

# Grafana 安装(示例以官方 yum 为例)
yum install -y https://dl.grafana.com/oss/release/grafana-10.2.2-1.x86_64.rpm
systemctl enable --now grafana-server

验证效果

# Prometheus targets 检查
curl -s http://127.0.0.1:9090/api/v1/targets | grep -A2 health

# Node Exporter 指标检查
curl -s http://127.0.0.1:9100/metrics | head

4. 关键指标与告警规则示例

# /opt/monitoring/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
  rules:
  - alert: NodeCPUHigh
    expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
    for: 10m
    labels:
      severity: P2
    annotations:
      summary: "CPU使用率过高"
      description: "实例 {{ $labels.instance }} CPU > 85% 持续10分钟"

规则加载验证

# 触发 Prometheus 规则重载
curl -X POST http://127.0.0.1:9090/-/reload

# 查询告警是否生效
curl -s http://127.0.0.1:9090/api/v1/alerts | jq '.data.alerts[]?.labels.alertname'

5. SLO 定义与计算示例
- SLI:HTTP 成功率
- SLO:30天成功率 99.9%

# 成功率 SLI
sum(rate(http_requests_total{job="app",code!~"5.."}[5m])) 
/ 
sum(rate(http_requests_total{job="app"}[5m]))

# 30天错误预算消耗(简化示例)
1 - (
  sum(rate(http_requests_total{job="app",code!~"5.."}[30d])) 
  / 
  sum(rate(http_requests_total{job="app"}[30d]))
)

SLO 看板建议字段
- 服务名称、指标名称、SLO目标、当前SLI、预算消耗百分比、剩余预算天数

6. 典型中间件接入示例(MySQL + Nginx + Redis)

# 安装 mysql_exporter(示例)
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.1/mysqld_exporter-0.15.1.linux-amd64.tar.gz
tar -xf mysqld_exporter-0.15.1.linux-amd64.tar.gz
mv mysqld_exporter-0.15.1.linux-amd64 /opt/monitoring/exporters/mysql

# 配置 MySQL 账户
mysql -e "CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'exporter'; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';"
cat >/opt/monitoring/exporters/mysql/.my.cnf <<'EOF'
[client]
user=exporter
password=exporter
EOF

# 启动 mysql_exporter
/opt/monitoring/exporters/mysql/mysqld_exporter \
  --config.my-cnf=/opt/monitoring/exporters/mysql/.my.cnf &

Prometheus 增加抓取

- job_name: "mysql"
  static_configs:
    - targets: ["127.0.0.1:9104"]

7. 告警治理与降噪策略(带配置示例)

# Alertmanager 路由示例:抑制与分级
route:
  receiver: "default"
  routes:
  - matchers: ["severity=P1"]
    receiver: "oncall"
    group_wait: 10s
    repeat_interval: 30m

inhibit_rules:
- source_matchers: ["severity=P1"]
  target_matchers: ["severity=P3"]
  equal: ["instance","alertname"]

8. 常见问题排查(命令与现象对应)
- Prometheus 无法抓取指标

# 检查目标是否可达
curl -s http://127.0.0.1:9100/metrics | head

# 查看 Prometheus targets 状态
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[]|{job:.labels.job,health:.health,lastError:.lastError}'
  • 告警未触发
# 检查规则是否加载
curl -s http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[].rules[]|select(.name=="NodeCPUHigh")'

# 检查告警表达式
curl -s "http://127.0.0.1:9090/api/v1/query?query=100-(avg by(instance)(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))*100)"
  • Grafana 无数据
# 确认数据源状态
curl -s http://127.0.0.1:3000/api/datasources | jq '.[].name'
# UI 中检查 Prometheus URL 是否为 http://127.0.0.1:9090

9. 关键落地步骤(含示例动作)
1) 统一服务目录与指标规范(定义 jobinstanceenv 标签标准)
2) 核心服务与关键链路监控先行(先接入交易链路)
3) 告警策略分级与降噪规则落地(P1/P2/P3 + 抑制)
4) SLO与错误预算试运行(30天试跑)
5) 复盘机制与持续优化闭环(每周SLO评审)

10. 练习与实操
- 练习1:为本机安装 node_exporter,完成 Prometheus 抓取并在 Grafana 创建 CPU 使用率面板。
- 练习2:编写一个自定义告警(磁盘使用率>80%持续10分钟),通过 Alertmanager 发送到邮箱。
- 练习3:模拟 5xx 错误并用 PromQL 计算 1小时错误预算消耗。
- 练习4:启用 inhibit_rules,验证 P1 触发时 P3 被抑制。

本案例展示了监控告警平台从工具搭建到SLO治理的演进路径,实现稳定性工程与运维平台化的协同落地。