19.7.8 发布度量与持续改进
发布度量与持续改进应形成“事件采集→指标计算→可视化→改进闭环”的体系。核心指标建议覆盖:发布频率、交付周期、变更失败率、MTTR、回滚率、审批耗时、窗口命中率、灰度覆盖率、SLA影响、成本效率。所有指标必须绑定明确的数据源与口径(指标字典),避免“同名异义”。
指标字典示例(口径统一)
指标名: 变更失败率
口径: 失败发布次数 / 总发布次数
数据源: 发布平台事件表
周期: 周
阈值: >10% 触发整改
落地示例:建设发布事件库并计算指标
1) 安装并初始化 MySQL(示例为 Ubuntu)
# 安装
sudo apt update
sudo apt install -y mysql-server
# 启动
sudo systemctl enable --now mysql
# 设置登录(示例:本地root空密码环境,生产需安全加固)
mysql -u root <<'SQL'
CREATE DATABASE ops_metrics;
USE ops_metrics;
CREATE TABLE release_events (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
app_name VARCHAR(64) NOT NULL,
env VARCHAR(16) NOT NULL,
version VARCHAR(64) NOT NULL,
status ENUM('success','failed','rollback') NOT NULL,
start_time DATETIME NOT NULL,
end_time DATETIME NOT NULL,
approver VARCHAR(32),
window_hit TINYINT(1) DEFAULT 1,
gray_ratio DECIMAL(5,2) DEFAULT 0.00
);
SQL
2) 采集事件(模拟由发布平台回调)
cat > /tmp/release_event.sql <<'SQL'
INSERT INTO release_events(app_name,env,version,status,start_time,end_time,approver,window_hit,gray_ratio)
VALUES
('order','prod','v1.2.3','success','2024-05-01 10:00:00','2024-05-01 10:15:00','alice',1,30.00),
('order','prod','v1.2.4','failed','2024-05-02 10:00:00','2024-05-02 10:10:00','bob',1,10.00),
('order','prod','v1.2.4','rollback','2024-05-02 10:12:00','2024-05-02 10:20:00','bob',1,10.00);
SQL
mysql -u root ops_metrics < /tmp/release_event.sql
3) 计算核心指标(示例 SQL)
-- 发布频率(天)
SELECT DATE(start_time) AS day, COUNT(*) AS releases
FROM release_events
GROUP BY DATE(start_time);
-- 变更失败率
SELECT
SUM(status='failed')/COUNT(*) AS fail_rate
FROM release_events;
-- MTTR(回滚恢复耗时,分钟)
SELECT
AVG(TIMESTAMPDIFF(MINUTE,start_time,end_time)) AS mttr_min
FROM release_events
WHERE status='rollback';
将指标推送到 Prometheus(示例 Pushgateway)
1) 安装并启动 Pushgateway
curl -L -o /tmp/pushgateway.tar.gz https://github.com/prometheus/pushgateway/releases/download/v1.6.2/pushgateway-1.6.2.linux-amd64.tar.gz
tar -xf /tmp/pushgateway.tar.gz -C /opt
/opt/pushgateway-1.6.2.linux-amd64/pushgateway &
2) 计算并推送失败率(示例脚本)
cat > /usr/local/bin/push_release_metrics.sh <<'SH'
#!/usr/bin/env bash
FAIL_RATE=$(mysql -u root ops_metrics -N -e \
"SELECT SUM(status='failed')/COUNT(*) FROM release_events;")
cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/release_metrics
release_fail_rate $FAIL_RATE
EOF
SH
chmod +x /usr/local/bin/push_release_metrics.sh
/usr/local/bin/push_release_metrics.sh
3) Prometheus 抓取配置(/etc/prometheus/prometheus.yml)
scrape_configs:
- job_name: 'pushgateway'
static_configs:
- targets: ['localhost:9091']
排错清单(含命令)
- 指标缺失:检查 Pushgateway 是否存活
bash
ss -lntp | grep 9091
curl -s http://localhost:9091/metrics | head
- 口径冲突:核对字段来源与时间范围
sql
DESCRIBE release_events;
SELECT MIN(start_time), MAX(start_time) FROM release_events;
- 时间不一致导致周期统计异常:检查 NTP
bash
timedatectl status
- Prometheus 未采集:确认配置与重载
bash
curl -X POST http://localhost:9090/-/reload
持续改进示例(PDCA落地)
- 发现问题:审批耗时高
- 原因分析:审批人集中、审批无并行
- 行动:设定并行审批与自动校验
- 验证:审批耗时指标下降(从 4h → 1h)
练习
1) 扩展 release_events 表,增加 change_ticket 字段并将工单号写入。
2) 统计“窗口命中率”与“灰度覆盖率”并推送到 Pushgateway。
3) 通过 Grafana(或任意可视化工具)展示发布频率与失败率趋势图。