12.8.5 运维操作与变更管理最佳实践

运维操作与变更管理最佳实践#

导语:本节围绕 Keepalived 变更流程、审计、回滚与演练,给出可执行的 SOP、命令示例与排错步骤,保证高可用切换可控、可审计、可恢复。

变更流程与审计原理草图

文章图片

1. 变更分级与审批(含示例)#

  • 分级:紧急/高/中/低。VIP切换、配置变更、版本升级归为高/紧急级。
  • 审批记录示例(变更单号 + 工单系统):
# 变更单号示例记录(落地到变更日志)
echo "CR-2024-0918 VIP切换 确认双人复核" | sudo tee -a /var/log/keepalived_change.log

2. 变更窗口与冻结期(含检查命令)#

  • 窗口校验:在变更前检查业务峰值与冻结期标识。
# 示例:读取业务冻结期标记
cat /etc/ops/change_freeze.flag && echo "冻结期内禁止变更"

3. 标准化操作流程(SOP)含完整步骤示例#

变更前检查(VIP、VRRP、健康检查)

# 1) VIP 是否存在
ip addr show | grep -E "10.10.10.100"

# 2) VRRP 状态(MASTER/BACKUP)
systemctl status keepalived --no-pager
grep -E "state MASTER|state BACKUP" /etc/keepalived/keepalived.conf

# 3) 健康检查脚本可执行性
test -x /etc/keepalived/check_http.sh && echo "健康检查脚本 OK"

变更中执行(配置变更 + 语法检查 + 重载)

# 1) 备份配置(带时间戳)
sudo cp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.bak.$(date +%F_%T)

# 2) 编辑配置(示例:调整优先级)
sudo sed -i 's/priority 120/priority 110/' /etc/keepalived/keepalived.conf

# 3) 配置语法检查(若版本支持)
sudo keepalived -t -f /etc/keepalived/keepalived.conf

# 4) 平滑重载
sudo systemctl reload keepalived

变更后验证(切换是否生效)

# 1) VIP 漂移结果验证(确认 VIP 归属)
ip addr show | grep -E "10.10.10.100"

# 2) VRRP 日志查看
sudo journalctl -u keepalived --since "5 min ago" | tail -n 50

# 3) 业务可达性检查
curl -I http://10.10.10.100:80

4. 配置管理与版本控制(Git 实战)#

将 Keepalived 配置纳入 Git,支持 diff 与回滚。

# 初始化仓库(建议放在专用配置库)
mkdir -p /opt/ops/keepalived && cd /opt/ops/keepalived
git init

# 添加配置
cp /etc/keepalived/keepalived.conf .
git add keepalived.conf
git commit -m "init: keepalived baseline"

# 变更后对比
git diff HEAD~1 HEAD

# 回滚到上一次版本
git checkout -- keepalived.conf
cp keepalived.conf /etc/keepalived/keepalived.conf && systemctl reload keepalived

5. 自动化与可重复性(Ansible 下发)#

# 示例 inventory
cat > /opt/ops/inventory.ini << 'EOF'
[keepalived]
10.10.10.11
10.10.10.12
EOF

# 下发配置并重载
ansible -i /opt/ops/inventory.ini keepalived -m copy \
  -a "src=/opt/ops/keepalived/keepalived.conf dest=/etc/keepalived/keepalived.conf owner=root group=root mode=0644"

ansible -i /opt/ops/inventory.ini keepalived -m service -a "name=keepalived state=reloaded"

6. 切换演练与验证(含故障模拟)#

模拟主机健康检查失败(触发切换)

# 停止被检测服务(示例 nginx)
sudo systemctl stop nginx

# 观察 VIP 漂移
ip addr show | grep -E "10.10.10.100"
sudo journalctl -u keepalived --since "2 min ago" | tail -n 20

7. 监控与告警闭环(关键指标与命令)#

  • 进程:Keepalived 是否存活
  • VRRP:MASTER/BACKUP 状态变化
  • VIP:是否可达
# 自检脚本示例(用于监控采集)
cat > /usr/local/bin/keepalived_check.sh << 'EOF'
#!/bin/bash
pgrep -x keepalived >/dev/null || exit 2
ip addr show | grep -q "10.10.10.100" && echo "VIP OK" || echo "VIP MISSING"
EOF
chmod +x /usr/local/bin/keepalived_check.sh

8. 权限与最小化原则(示例)#

# 创建受控运维组并限制执行
sudo groupadd ops
sudo usermod -aG ops admin

# sudoers 最小权限(示例)
echo "%ops ALL=(root) NOPASSWD: /bin/systemctl reload keepalived, /bin/cp /etc/keepalived/keepalived.conf*" | sudo tee /etc/sudoers.d/keepalived_ops

9. 回滚策略(明确条件与时间窗)#

回滚条件:VIP 未漂移、业务不可达、VRRP 反复抖动。
回滚命令

# 使用备份快速回滚
sudo cp /etc/keepalived/keepalived.conf.bak.2024-09-18_10:30:00 /etc/keepalived/keepalived.conf
sudo systemctl reload keepalived

10. 排错清单(快速定位)#

  1. VIP 不漂移:检查 prioritynopreemptvrrp_script 返回值
  2. 配置不生效:确认 keepalived -t 语法检查
  3. VRRP 广播异常:检查防火墙与组播
# 检查 VRRP 通信(多播 224.0.0.18/协议112)
sudo iptables -L -n | grep 112

11. 练习与演练任务#

  1. 练习 1:将 priority 从 120 调整为 110,执行一次平滑重载并验证 VIP 归属。
  2. 练习 2:模拟 nginx 停止触发切换,记录日志并输出演练报告。
  3. 练习 3:使用 Git 回滚到上一版本配置并验证 VRRP 状态。