5.5.5 生产脚本的可观测性与告警联动
5. 生产脚本的可观测性与告警联动#
生产脚本不仅要“能跑”,还要“可观测”。本节给出可观测性最小闭环:结构化日志 + 指标上报 + 告警通知,并提供可运行示例、安装、排错与练习。
一、原理草图:脚本可观测性闭环
二、最小落地:结构化日志模板 + 审计
- 目标:统一字段,便于检索与定位;每次执行保留审计记录。
示例脚本:/opt/scripts/backup.sh
#!/usr/bin/env bash
set -euo pipefail
SCRIPT="backup"
VERSION="1.0.0"
LOG_DIR="/var/log/ops"
LOG_FILE="${LOG_DIR}/${SCRIPT}.log"
AUDIT_FILE="/var/log/ops/audit.log"
TRACE_ID="${TRACE_ID:-$(date +%s)-$$}"
mkdir -p "$LOG_DIR"
log_json() {
local level="$1" stage="$2" msg="$3" code="${4:-0}"
printf '{"ts":"%s","script":"%s","ver":"%s","trace_id":"%s","level":"%s","stage":"%s","code":%s,"msg":"%s"}\n' \
"$(date -Iseconds)" "$SCRIPT" "$VERSION" "$TRACE_ID" "$level" "$stage" "$code" "$msg" >> "$LOG_FILE"
}
audit() {
printf '%s|%s|%s|%s|%s|%s\n' \
"$(date -Iseconds)" "$SCRIPT" "$TRACE_ID" "${1:-cron}" "${2:-ok}" "${3:-0}" >> "$AUDIT_FILE"
}
trap 'log_json "ERROR" "trap" "unexpected error" 1; audit "unknown" "fail" 1' ERR
start_ts=$(date +%s)
log_json "INFO" "start" "backup start"
# 关键依赖检查
if ! nc -z 127.0.0.1 3306; then
log_json "ERROR" "check" "mysql not reachable" 2
audit "cron" "fail" 2
exit 2
fi
# 业务操作
log_json "INFO" "exec" "run mysqldump"
mysqldump -uroot -p'123456' --all-databases > /data/backup/all.sql
end_ts=$(date +%s)
duration=$((end_ts - start_ts))
log_json "INFO" "end" "backup success duration=${duration}s"
audit "cron" "ok" 0
命令解释
- set -euo pipefail:遇错退出、未定义变量报错、管道失败即失败。
- nc -z 127.0.0.1 3306:检查 MySQL 端口可用性。
- mysqldump ... > /data/backup/all.sql:数据库备份输出。
验证
bash /opt/scripts/backup.sh
tail -f /var/log/ops/backup.log
tail -f /var/log/ops/audit.log
三、指标上报:Pushgateway + Prometheus
- 目标:用指标量化执行结果、耗时与失败次数。
安装 Pushgateway(以二进制为例)
useradd -r -s /sbin/nologin pushgw
cd /usr/local/src
curl -L -O https://github.com/prometheus/pushgateway/releases/download/v1.6.2/pushgateway-1.6.2.linux-amd64.tar.gz
tar xf pushgateway-1.6.2.linux-amd64.tar.gz
mv pushgateway-1.6.2.linux-amd64 /usr/local/pushgateway
chown -R pushgw:pushgw /usr/local/pushgateway
cat >/etc/systemd/system/pushgateway.service <<'EOF'
[Unit]
Description=Prometheus Pushgateway
After=network.target
[Service]
User=pushgw
ExecStart=/usr/local/pushgateway/pushgateway
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now pushgateway
脚本上报指标(追加到 backup.sh)
PUSHGW="http://127.0.0.1:9091"
JOB="backup"
push_metrics() {
local success="$1" duration="$2" retries="$3"
cat <<EOF | curl --data-binary @- "${PUSHGW}/metrics/job/${JOB}"
script_job_success{job="${JOB}"} ${success}
script_job_duration_seconds{job="${JOB}"} ${duration}
script_job_retries_total{job="${JOB}"} ${retries}
EOF
}
# 在成功/失败处调用
# 成功
push_metrics 1 "$duration" 0
# 失败(示例)
# push_metrics 0 "$duration" 0
Prometheus 抓取配置(prometheus.yml)
scrape_configs:
- job_name: 'pushgateway'
static_configs:
- targets: ['127.0.0.1:9091']
验证
curl -s http://127.0.0.1:9091/metrics | grep script_job
四、告警联动:Alertmanager 与 Webhook
- 目标:失败/超时触发通知,内容包含任务、阶段与 trace_id。
Alertmanager 规则示例(Prometheus 告警规则)
groups:
- name: script_alerts
rules:
- alert: ScriptJobFailed
expr: script_job_success{job="backup"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "backup job failed"
description: "backup failed, check logs: /var/log/ops/backup.log"
Webhook 接收端(简单示例)
# 简易HTTP接收端(仅演示)
python3 -m http.server 8081
Alertmanager 配置片段
receivers:
- name: webhook-demo
webhook_configs:
- url: "http://127.0.0.1:8081"
route:
receiver: webhook-demo
五、排错与调试清单
1. Pushgateway 无指标
- 检查服务:systemctl status pushgateway
- 查看指标:curl -s http://127.0.0.1:9091/metrics
- 脚本是否执行 curl --data-binary 成功(返回码 202)
2. Prometheus 无抓取
- 检查 targets:http://prometheus:9090/targets
- 确认 prometheus.yml 中地址与端口一致
3. 告警不触发
- 在 Prometheus 里执行表达式查询 script_job_success
- 查看 Alertmanager 日志:journalctl -u alertmanager -f
4. 日志不落盘
- 确认 LOG_DIR 权限、磁盘空间:df -h,ls -ld /var/log/ops
六、实战练习
1. 练习1:手动制造失败
- 停止 MySQL 后运行脚本,观察日志与指标变化。
- 命令:
bash
systemctl stop mysql
bash /opt/scripts/backup.sh || true
tail -n 5 /var/log/ops/backup.log
curl -s http://127.0.0.1:9091/metrics | grep script_job
2. 练习2:增加阶段耗时指标
- 增加 dump_duration_seconds 指标,统计 mysqldump 执行耗时。
3. 练习3:告警去重与降噪
- 在告警规则中加入 for: 5m,观察告警延迟与抑制效果。
4. 练习4:日志检索
- 使用 jq 或 grep 查找失败阶段:
bash
grep '"level":"ERROR"' /var/log/ops/backup.log
七、标准化建议
- 将可观测能力做成脚本模板:日志字段固定、指标固定、告警规则版本化。
- 关键路径优先埋点:依赖检查、耗时步骤、退出码。
- 定期演练告警链路,确保通知可靠。