13.7.4 HAProxy联动脚本与故障判定策略
本节聚焦 HAProxy 与 Keepalived 的联动脚本与故障判定策略,提供可执行脚本、配置、排错与练习,确保 VIP 切换“可控、可观测、可回溯”。
1. 原理与联动流程草图
2. 环境准备与安装示例(两节点)
# CentOS/RHEL
yum -y install haproxy keepalived
# Ubuntu/Debian
apt -y install haproxy keepalived
# 验证服务
systemctl status haproxy keepalived
3. HAProxy stats socket 与最小配置示例
- 目的:为脚本提供本地健康数据(避免远端探测误差)。
- 文件路径:/etc/haproxy/haproxy.cfg
global
log /dev/log local0
maxconn 20000
stats socket /run/haproxy/admin.sock mode 660 level admin
defaults
mode http
timeout connect 5s
timeout client 30s
timeout server 30s
frontend fe_http
bind *:80
default_backend be_app
backend be_app
balance roundrobin
option httpchk GET /health
server s1 10.0.0.11:8080 check
server s2 10.0.0.12:8080 check
命令解释:
- stats socket /run/haproxy/admin.sock:启用本地 UNIX Socket,脚本读取后端状态。
- option httpchk:启用 HTTP 健康检查,辅助后端判定。
4. Keepalived 联动脚本示例(含阈值、计数、落盘)
- 脚本路径:/etc/keepalived/check_haproxy.sh
#!/usr/bin/env bash
set -euo pipefail
SOCK="/run/haproxy/admin.sock"
STATE_FILE="/var/run/ha_check.state"
MIN_UP=1
MAX_FAIL=3
CPU_MAX=90
MEM_MAX=90
# 读取上次失败次数
fail_cnt=0
[[ -f "$STATE_FILE" ]] && fail_cnt=$(cat "$STATE_FILE" || echo 0)
log() { logger -t keepalived-check "$1"; }
# 1) 进程检测
if ! pidof haproxy >/dev/null 2>&1; then
log "haproxy process not found"
fail_cnt=$((fail_cnt+1))
else
# 2) 端口监听
if ! ss -lntp | grep -q ":80 .*haproxy"; then
log "haproxy port 80 not listening"
fail_cnt=$((fail_cnt+1))
else
# 3) Stats Socket 后端 UP 数量
if [[ -S "$SOCK" ]]; then
up_cnt=$(echo "show stat" | socat stdio "$SOCK" \
| awk -F, '$1 ~ /^be_/ && $18=="UP"{c++} END{print c+0}')
if [[ "$up_cnt" -lt "$MIN_UP" ]]; then
log "backend up count too low: $up_cnt"
fail_cnt=$((fail_cnt+1))
fi
else
log "stats socket missing: $SOCK"
fail_cnt=$((fail_cnt+1))
fi
# 4) 资源阈值
cpu=$(awk -v FS=" " '/^cpu /{u=$2+$4; t=$2+$4+$5; printf("%d", u*100/t)}' /proc/stat)
mem=$(awk '/MemTotal/{t=$2}/MemAvailable/{a=$2} END{printf("%d", (t-a)*100/t)}' /proc/meminfo)
if [[ "$cpu" -ge "$CPU_MAX" || "$mem" -ge "$MEM_MAX" ]]; then
log "resource high: cpu=${cpu} mem=${mem}"
fail_cnt=$((fail_cnt+1))
fi
fi
fi
# 连续失败阈值判断
if [[ "$fail_cnt" -ge "$MAX_FAIL" ]]; then
echo "$fail_cnt" > "$STATE_FILE"
log "check FAIL, fail_cnt=$fail_cnt"
exit 1
else
echo "0" > "$STATE_FILE"
log "check OK"
exit 0
fi
脚本命令解释:
- pidof haproxy:判定主进程存活。
- ss -lntp:检查前端端口监听。
- socat stdio /run/haproxy/admin.sock:读取 HAProxy stats。
- /proc/stat、/proc/meminfo:计算 CPU/内存占用阈值。
5. Keepalived 配置联动示例
- 文件路径:/etc/keepalived/keepalived.conf
vrrp_script chk_haproxy {
script "/etc/keepalived/check_haproxy.sh"
interval 2
timeout 1
fall 3
rise 2
weight -20
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 150
advert_int 1
nopreempt
authentication {
auth_type PASS
auth_pass 123456
}
track_script {
chk_haproxy
}
virtual_ipaddress {
10.0.0.100/24 dev eth0
}
notify_master "/usr/local/bin/notify.sh master"
notify_backup "/usr/local/bin/notify.sh backup"
notify_fault "/usr/local/bin/notify.sh fault"
}
6. 通知脚本示例(告警与审计)
- 文件路径:/usr/local/bin/notify.sh
#!/usr/bin/env bash
state="$1"
logger -t keepalived-notify "state changed to $state on $(hostname)"
# 可扩展:curl 调用告警接口或发送邮件
7. 启动与验证命令
chmod +x /etc/keepalived/check_haproxy.sh /usr/local/bin/notify.sh
systemctl enable --now haproxy keepalived
# 验证 VIP 漂移
ip addr show eth0 | grep 10.0.0.100
# 手动触发检查
/etc/keepalived/check_haproxy.sh; echo $?
8. 故障演练与预期效果
# 模拟 HAProxy 停止
systemctl stop haproxy
# 观察 Keepalived 日志
journalctl -u keepalived -f
# 预期效果:脚本返回非 0,VIP 漂移到备机
9. 常见故障与排错命令
- VIP 未漂移
# 检查 keepalived 状态与脚本执行结果
systemctl status keepalived
journalctl -u keepalived --no-pager | tail -n 50
- Stats Socket 读取失败
ls -l /run/haproxy/admin.sock
echo "show stat" | socat stdio /run/haproxy/admin.sock | head
- 端口检测误判
ss -lntp | grep haproxy
10. 练习
1) 将 MIN_UP 从 1 调整为 2,观察在后端仅剩 1 台时的切换行为。
2) 将 interval 改为 1、timeout 改为 1,观测抖动风险并在日志中定位原因。
3) 将 nopreempt 注释,验证回切是否频繁,记录切换次数。