19.4.8 自动化巡检与自愈策略
自动化巡检与自愈策略旨在将“发现—诊断—处置—验证—闭环”的流程标准化,实现故障前置识别与快速恢复。巡检覆盖主机、网络、存储、数据库、中间件与容器平台,采集指标、日志、配置与服务状态,形成基线与趋势对比,输出风险评分与整改建议。
原理草图(巡检+自愈闭环):
巡检设计要点包括:定义SLA与关键指标(CPU/内存/磁盘/IO/连接数/延迟/错误率)、建立配置基线(内核参数、系统服务、端口开放、版本与补丁)、引入应用层健康探针(HTTP探活、事务探活、读写一致性)、与CMDB/资产清单联动(标签、环境、业务线、负责人)。
自愈策略遵循分级处置与可回滚原则:
- 轻量自愈:进程拉起、服务重启、连接池重置、缓存清理、临时限流与降级。
- 中度自愈:实例下线重建、滚动重启、配置回退、自动扩容与缩容。
- 重度自愈:主备切换、流量迁移、数据恢复、故障隔离与熔断。
环境准备与安装示例(Prometheus + Alertmanager + Node Exporter)#
用于巡检的基础指标采集与告警触发,便于后续自愈执行。
# 安装 Node Exporter
useradd -r -s /sbin/nologin node_exporter
cd /opt
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar -xf node_exporter-1.6.1.linux-amd64.tar.gz
mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
cat >/etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload && systemctl enable --now node_exporter
# 安装 Prometheus(简化示例)
useradd -r -s /sbin/nologin prometheus
cd /opt
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.48.1/prometheus-2.48.1.linux-amd64.tar.gz
tar -xf prometheus-2.48.1.linux-amd64.tar.gz
mv prometheus-2.48.1.linux-amd64 /etc/prometheus
cat >/etc/prometheus/prometheus.yml <<'EOF'
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['127.0.0.1:9100']
EOF
cat >/etc/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
ExecStart=/etc/prometheus/prometheus --config.file=/etc/prometheus/prometheus.yml
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload && systemctl enable --now prometheus
验证采集是否正常:
curl -s http://127.0.0.1:9100/metrics | head
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[].health'
自动化巡检示例(主机与中间件)#
巡检脚本输出 JSON,便于编排系统接入(CI/CD 或任务编排)。
cat >/usr/local/bin/health_check.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
ts=$(date +%F_%T)
host=$(hostname)
load=$(awk '{print $1}' /proc/loadavg)
disk=$(df -P / | awk 'NR==2{print $5}' | tr -d '%')
ntp=$(timedatectl | awk -F': ' '/System clock synchronized/{print $2}')
# MySQL 健康(需要 mysql 客户端)
mysql_ok="unknown"
if command -v mysql >/dev/null; then
mysql -h127.0.0.1 -uroot -e "select 1" >/dev/null 2>&1 && mysql_ok="ok" || mysql_ok="fail"
fi
# Nginx 健康
nginx_ok="unknown"
if command -v curl >/dev/null; then
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1/ | grep -q 200 && nginx_ok="ok" || nginx_ok="fail"
fi
cat <<JSON
{
"ts":"$ts",
"host":"$host",
"load":"$load",
"disk_root_percent":"$disk",
"ntp_sync":"$ntp",
"mysql":"$mysql_ok",
"nginx":"$nginx_ok"
}
JSON
EOF
chmod +x /usr/local/bin/health_check.sh
# 执行
/usr/local/bin/health_check.sh
自愈动作库示例(Systemd + 回滚)#
cat >/usr/local/bin/self_heal.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
service="$1"
action="${2:-restart}"
log=/var/log/self_heal.log
echo "$(date +%F_%T) action=$action service=$service" >>"$log"
# 幂等:仅当服务未运行时重启
if ! systemctl is-active --quiet "$service"; then
systemctl "$action" "$service"
sleep 2
systemctl is-active --quiet "$service" || {
echo "$(date +%F_%T) fail service=$service rollback=stop" >>"$log"
systemctl stop "$service"
exit 1
}
fi
EOF
chmod +x /usr/local/bin/self_heal.sh
# 触发:重启 nginx
/usr/local/bin/self_heal.sh nginx restart
任务编排示例(基于 Cron 与结果上报)#
# 每5分钟巡检并上报至 API(伪地址)
cat >/etc/cron.d/auto_inspect <<'EOF'
*/5 * * * * root /usr/local/bin/health_check.sh | \
curl -s -XPOST -H 'Content-Type: application/json' \
-d @- http://ops.example.com/api/inspect
EOF
容器与K8s巡检示例#
# Docker 容器状态巡检
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}'
# K8s 节点与Pod健康
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running
# 关键控制面组件健康
kubectl -n kube-system get pods -l component=kube-apiserver
配置基线与差异检测示例#
# 采集系统基线
sysctl -a > /var/lib/inspect/baseline_sysctl.txt
ss -lntp > /var/lib/inspect/baseline_ports.txt
rpm -qa > /var/lib/inspect/baseline_pkgs.txt
# 对比差异(示例)
diff -u /var/lib/inspect/baseline_sysctl.txt /var/lib/inspect/current_sysctl.txt || true
常见故障排查(示例)#
1) Prometheus 采集失败:
# 目标状态检查
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[]|{discoveredLabels,health,lastError}'
# 常见问题:node_exporter 未启动或被防火墙阻挡
systemctl status node_exporter
ss -lntp | grep 9100
2) 自愈脚本无效:
# 检查服务是否存在与权限
systemctl status nginx
ls -l /usr/local/bin/self_heal.sh
journalctl -u nginx -n 50
3) 巡检脚本输出空:
# 确认依赖命令
command -v curl mysql awk
# 运行 debug
bash -x /usr/local/bin/health_check.sh
练习#
1) 为 Redis 增加巡检项:检测 used_memory 与 rdb_last_bgsave_status,并在异常时触发 self_heal.sh redis.
2) 为 Nginx 证书到期巡检:读取证书剩余天数,小于 7 天输出风险等级。
3) 实现“单点多告警只触发一次处置”:使用文件锁或 Redis setnx 做幂等。
4) 为 K8s Pod CrashLoopBackOff 添加自愈动作:仅在非生产命名空间执行 kubectl rollout restart.
关键命令解释#
systemctl is-active --quiet:返回服务运行状态,用于幂等判断。curl -s -o /dev/null -w "%{http_code}":仅输出 HTTP 状态码便于探活。kubectl get pods -A --field-selector=status.phase!=Running:快速筛选异常 Pod。diff -u:输出基线差异,便于审计与整改。
通过巡检结果闭环整改、策略迭代、知识库沉淀与SOP更新,持续降低MTTR并提升稳定性。