12.4.3 通知脚本与故障处理流程

通知脚本与故障处理流程#

Keepalived 在状态变更时触发通知脚本,用于 VIP 接管、服务拉起、告警与记录。通知脚本必须快速、幂等、可重入,并与健康检查结果联动。以下给出原理草图、安装准备、配置示例、脚本示例、故障处理流程与排错步骤。

原理草图(状态切换触发脚本)#

文章图片

安装与目录准备#

# 以 RHEL/CentOS 为例
sudo yum install -y keepalived

# 通知脚本目录
sudo mkdir -p /etc/keepalived/scripts
sudo chown root:root /etc/keepalived/scripts
sudo chmod 750 /etc/keepalived/scripts

# 日志目录
sudo mkdir -p /var/log/keepalived
sudo chown root:root /var/log/keepalived

Keepalived 配置示例(含通知脚本)#

# /etc/keepalived/keepalived.conf
global_defs {
  router_id LVS_NODE_1
  script_user root
  enable_script_security
}

vrrp_script chk_nginx {
  script "/bin/sh /etc/keepalived/scripts/check_nginx.sh"
  interval 2
  timeout 2
  fall 3
  rise 2
}

vrrp_instance VI_1 {
  state BACKUP
  interface eth0
  virtual_router_id 51
  priority 100
  advert_int 1
  preempt_delay 5
  nopreempt

  authentication {
    auth_type PASS
    auth_pass 1111
  }

  virtual_ipaddress {
    10.0.0.100/24 dev eth0
  }

  track_script {
    chk_nginx
  }

  notify_master "/etc/keepalived/scripts/notify_master.sh VI_1 MASTER"
  notify_backup "/etc/keepalived/scripts/notify_backup.sh VI_1 BACKUP"
  notify_fault  "/etc/keepalived/scripts/notify_fault.sh VI_1 FAULT"
}

健康检查脚本示例(明确命令解释)#

# /etc/keepalived/scripts/check_nginx.sh
#!/bin/bash
# 说明:检查 80 端口是否监听,失败返回非0触发降级
ss -lntp | grep -q ':80' && exit 0
exit 1

通知脚本示例(幂等、日志、告警)#

# /etc/keepalived/scripts/notify_master.sh
#!/bin/bash
INSTANCE="$1"; STATE="$2"
VIP="10.0.0.100"
LOG="/var/log/keepalived/notify.log"

echo "$(date '+%F %T') $INSTANCE $STATE begin" >> "$LOG"

# 绑定VIP(幂等)
if ! ip addr show dev eth0 | grep -q "$VIP"; then
  ip addr add ${VIP}/24 dev eth0
fi

# 启动服务(示例:nginx)
systemctl start nginx
systemctl is-active --quiet nginx || echo "$(date '+%F %T') nginx start failed" >> "$LOG"

# 告警(示例:本地logger)
logger -t keepalived "[$INSTANCE] to MASTER, VIP $VIP"

echo "$(date '+%F %T') $INSTANCE $STATE done" >> "$LOG"
# /etc/keepalived/scripts/notify_backup.sh
#!/bin/bash
INSTANCE="$1"; STATE="$2"
VIP="10.0.0.100"
LOG="/var/log/keepalived/notify.log"

echo "$(date '+%F %T') $INSTANCE $STATE begin" >> "$LOG"

# 解绑VIP
ip addr del ${VIP}/24 dev eth0 2>/dev/null

# 降级服务(示例:停止写服务)
systemctl stop nginx

logger -t keepalived "[$INSTANCE] to BACKUP, VIP removed"

echo "$(date '+%F %T') $INSTANCE $STATE done" >> "$LOG"
# /etc/keepalived/scripts/notify_fault.sh
#!/bin/bash
INSTANCE="$1"; STATE="$2"
LOG="/var/log/keepalived/notify.log"

echo "$(date '+%F %T') $INSTANCE $STATE begin" >> "$LOG"

systemctl stop nginx
logger -t keepalived "[$INSTANCE] to FAULT, service stopped"

echo "$(date '+%F %T') $INSTANCE $STATE done" >> "$LOG"
# 授权可执行
sudo chmod +x /etc/keepalived/scripts/*.sh

故障处理流程(可执行步骤)#

  1. 事件发现:chk_nginx 连续失败触发 FAULT 或降级。
  2. 快速判定:通知脚本读取 $1/$2 判断状态并执行幂等动作。
  3. 资源接管:MASTER 节点绑定 VIP 并拉起服务。
  4. 业务校验:脚本内 systemctl is-activecurl 验证。
  5. 告警通知:通过 logger/HTTP webhook/邮件发送。
  6. 故障记录:写入 /var/log/keepalived/notify.log
  7. 回切策略:设置 nopreemptpreempt_delay 控制抖动。

排错与验证(含明确命令)#

# 查看 keepalived 运行状态
systemctl status keepalived

# 查看 VRRP 与脚本执行日志
journalctl -u keepalived -f

# 验证 VIP 是否绑定
ip addr show dev eth0 | grep 10.0.0.100

# 模拟故障:停止 nginx 触发降级
systemctl stop nginx

# 验证 notify.log 是否记录
tail -f /var/log/keepalived/notify.log

常见问题与处理:
- 脚本未执行:检查 enable_script_security 与脚本权限;确认脚本路径正确。
- 切换后服务未拉起:检查 systemctl 结果与 SELinux/防火墙。
- 频繁抖动:增加 preempt_delay,提高 fall 次数,脚本中避免耗时操作。
- VIP 绑定失败:检查网卡名称与 VIP/掩码配置是否匹配。

练习#

  1. notify_master.sh 增加 curl http://127.0.0.1:80/health 校验,并在失败时写入日志。
  2. notify_fault.sh 增加发送 HTTP 告警(自建 webhook 接口)。
  3. 通过 ip addr del/add 模拟 VIP 漂移,观察 notify.log 记录是否完整。