13.7.4 HAProxy联动脚本与故障判定策略

本节聚焦 HAProxy 与 Keepalived 的联动脚本与故障判定策略,提供可执行脚本、配置、排错与练习,确保 VIP 切换“可控、可观测、可回溯”。

1. 原理与联动流程草图

文章图片

2. 环境准备与安装示例(两节点)

# CentOS/RHEL
yum -y install haproxy keepalived

# Ubuntu/Debian
apt -y install haproxy keepalived

# 验证服务
systemctl status haproxy keepalived

3. HAProxy stats socket 与最小配置示例
- 目的:为脚本提供本地健康数据(避免远端探测误差)。
- 文件路径:/etc/haproxy/haproxy.cfg

global
  log /dev/log local0
  maxconn 20000
  stats socket /run/haproxy/admin.sock mode 660 level admin

defaults
  mode http
  timeout connect 5s
  timeout client  30s
  timeout server  30s

frontend fe_http
  bind *:80
  default_backend be_app

backend be_app
  balance roundrobin
  option httpchk GET /health
  server s1 10.0.0.11:8080 check
  server s2 10.0.0.12:8080 check

命令解释:
- stats socket /run/haproxy/admin.sock:启用本地 UNIX Socket,脚本读取后端状态。
- option httpchk:启用 HTTP 健康检查,辅助后端判定。

4. Keepalived 联动脚本示例(含阈值、计数、落盘)
- 脚本路径:/etc/keepalived/check_haproxy.sh

#!/usr/bin/env bash
set -euo pipefail

SOCK="/run/haproxy/admin.sock"
STATE_FILE="/var/run/ha_check.state"
MIN_UP=1
MAX_FAIL=3
CPU_MAX=90
MEM_MAX=90

# 读取上次失败次数
fail_cnt=0
[[ -f "$STATE_FILE" ]] && fail_cnt=$(cat "$STATE_FILE" || echo 0)

log() { logger -t keepalived-check "$1"; }

# 1) 进程检测
if ! pidof haproxy >/dev/null 2>&1; then
  log "haproxy process not found"
  fail_cnt=$((fail_cnt+1))
else
  # 2) 端口监听
  if ! ss -lntp | grep -q ":80 .*haproxy"; then
    log "haproxy port 80 not listening"
    fail_cnt=$((fail_cnt+1))
  else
    # 3) Stats Socket 后端 UP 数量
    if [[ -S "$SOCK" ]]; then
      up_cnt=$(echo "show stat" | socat stdio "$SOCK" \
        | awk -F, '$1 ~ /^be_/ && $18=="UP"{c++} END{print c+0}')
      if [[ "$up_cnt" -lt "$MIN_UP" ]]; then
        log "backend up count too low: $up_cnt"
        fail_cnt=$((fail_cnt+1))
      fi
    else
      log "stats socket missing: $SOCK"
      fail_cnt=$((fail_cnt+1))
    fi

    # 4) 资源阈值
    cpu=$(awk -v FS=" " '/^cpu /{u=$2+$4; t=$2+$4+$5; printf("%d", u*100/t)}' /proc/stat)
    mem=$(awk '/MemTotal/{t=$2}/MemAvailable/{a=$2} END{printf("%d", (t-a)*100/t)}' /proc/meminfo)
    if [[ "$cpu" -ge "$CPU_MAX" || "$mem" -ge "$MEM_MAX" ]]; then
      log "resource high: cpu=${cpu} mem=${mem}"
      fail_cnt=$((fail_cnt+1))
    fi
  fi
fi

# 连续失败阈值判断
if [[ "$fail_cnt" -ge "$MAX_FAIL" ]]; then
  echo "$fail_cnt" > "$STATE_FILE"
  log "check FAIL, fail_cnt=$fail_cnt"
  exit 1
else
  echo "0" > "$STATE_FILE"
  log "check OK"
  exit 0
fi

脚本命令解释:
- pidof haproxy:判定主进程存活。
- ss -lntp:检查前端端口监听。
- socat stdio /run/haproxy/admin.sock:读取 HAProxy stats。
- /proc/stat/proc/meminfo:计算 CPU/内存占用阈值。

5. Keepalived 配置联动示例
- 文件路径:/etc/keepalived/keepalived.conf

vrrp_script chk_haproxy {
  script "/etc/keepalived/check_haproxy.sh"
  interval 2
  timeout 1
  fall 3
  rise 2
  weight -20
}

vrrp_instance VI_1 {
  state MASTER
  interface eth0
  virtual_router_id 51
  priority 150
  advert_int 1
  nopreempt

  authentication {
    auth_type PASS
    auth_pass 123456
  }

  track_script {
    chk_haproxy
  }

  virtual_ipaddress {
    10.0.0.100/24 dev eth0
  }

  notify_master "/usr/local/bin/notify.sh master"
  notify_backup "/usr/local/bin/notify.sh backup"
  notify_fault  "/usr/local/bin/notify.sh fault"
}

6. 通知脚本示例(告警与审计)
- 文件路径:/usr/local/bin/notify.sh

#!/usr/bin/env bash
state="$1"
logger -t keepalived-notify "state changed to $state on $(hostname)"
# 可扩展:curl 调用告警接口或发送邮件

7. 启动与验证命令

chmod +x /etc/keepalived/check_haproxy.sh /usr/local/bin/notify.sh
systemctl enable --now haproxy keepalived

# 验证 VIP 漂移
ip addr show eth0 | grep 10.0.0.100

# 手动触发检查
/etc/keepalived/check_haproxy.sh; echo $?

8. 故障演练与预期效果

# 模拟 HAProxy 停止
systemctl stop haproxy

# 观察 Keepalived 日志
journalctl -u keepalived -f

# 预期效果:脚本返回非 0,VIP 漂移到备机

9. 常见故障与排错命令
- VIP 未漂移

# 检查 keepalived 状态与脚本执行结果
systemctl status keepalived
journalctl -u keepalived --no-pager | tail -n 50
  • Stats Socket 读取失败
ls -l /run/haproxy/admin.sock
echo "show stat" | socat stdio /run/haproxy/admin.sock | head
  • 端口检测误判
ss -lntp | grep haproxy

10. 练习
1) 将 MIN_UP 从 1 调整为 2,观察在后端仅剩 1 台时的切换行为。
2) 将 interval 改为 1、timeout 改为 1,观测抖动风险并在日志中定位原因。
3) 将 nopreempt 注释,验证回切是否频繁,记录切换次数。