13.7.7 常见问题与排障思路

常见问题与排障思路#

原理草图:HAProxy+Keepalived高可用故障流转

文章图片

0. 基础安装与日志准备(排障前提)#

目的:确保组件安装、日志可用、关键内核参数就绪,排障时有证据链。

# 安装(CentOS/RHEL)
yum install -y haproxy keepalived rsyslog

# 启动与开机自启
systemctl enable --now haproxy keepalived rsyslog

# 日志检查(Keepalived)
journalctl -u keepalived -n 100 --no-pager

# 日志检查(HAProxy)
journalctl -u haproxy -n 100 --no-pager

关键内核参数(VIP漂移/ARP)

# /etc/sysctl.d/99-keepalived.conf
net.ipv4.ip_nonlocal_bind = 1
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2

sysctl -p /etc/sysctl.d/99-keepalived.conf

1. VIP无法漂移或漂移失败#

现象:主机宕机后VIP仍停留在原主机,或新主机未绑定VIP。
排查步骤与示例

# 1) 查看Keepalived状态与日志
systemctl status keepalived
journalctl -u keepalived -n 200 --no-pager | tail -n 50

# 2) 校验VIP是否绑定
ip a | grep -A2 "vip"

# 3) 检查VRRP协议是否被防火墙阻断(协议号112)
iptables -L -n | grep 112
firewall-cmd --list-all | grep vrrp

VRRP配置核对示例(两端需一致)

# /etc/keepalived/keepalived.conf
vrrp_instance VI_1 {
    state MASTER               # 备机为 BACKUP
    interface eth0
    virtual_router_id 51       # 两端必须一致
    priority 150               # 备机 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111         # 两端必须一致
    }
    virtual_ipaddress {
        192.168.10.100/24 dev eth0 label eth0:vip
    }
}

预期效果
- 主机停止 Keepalived 后,备机 1-3 秒内绑定 VIP:

systemctl stop keepalived
ip a | grep -A2 "vip"

2. 双主(Split-Brain)问题#

现象:两台机器同时绑定VIP,对外访问异常。
排查步骤与示例

# 两端确认是否同时存在 VIP
ip a | grep -A2 "vip"

# 检查心跳网络是否抖动
ping -c 10 <peer_ip>
mtr -r -c 20 <peer_ip>

Keepalived配置防抖建议

vrrp_instance VI_1 {
    priority 150
    nopreempt              # 防止网络抖动时频繁抢占
    advert_int 1
    unicast_peer {         # 单播模式避免二层广播干扰
        192.168.10.12
    }
}

说明nopreempt 禁止低优先级节点抢占,避免双主风险。


3. 切换频繁抖动#

现象:主备频繁切换,业务波动明显。
排查步骤与示例

# 查看Keepalived切换日志(频繁 "Transition to MASTER"
journalctl -u keepalived | grep -E "Transition|Entering"

# 检查健康脚本运行频率与状态
grep vrrp_script -n /etc/keepalived/keepalived.conf

脚本与阈值建议

vrrp_script chk_haproxy {
    script "/etc/keepalived/check_haproxy.sh"
    interval 3     # 检查间隔
    fall 3         # 连续失败次数触发
    rise 2         # 连续成功次数恢复
    weight -30
}
# /etc/keepalived/check_haproxy.sh
#!/bin/bash
pidof haproxy >/dev/null 2>&1
# 返回码:0 正常,非0异常

4. HAProxy联动脚本未生效#

现象:HAProxy异常停止,但Keepalived未触发切换。
排查步骤与示例

# 1) 脚本权限
ls -l /etc/keepalived/check_haproxy.sh

# 2) 手动执行检查返回码
/etc/keepalived/check_haproxy.sh; echo $?

# 3) track_script 是否绑定
grep -n "track_script" -n /etc/keepalived/keepalived.conf

Keepalived联动配置

vrrp_instance VI_1 {
    track_script {
        chk_haproxy
    }
}

说明weight -30 必须足够让 MASTER 优先级下降,触发切换。


5. 客户端访问中断或连接重置#

现象:切换后短时访问中断或长连接断开。
排查步骤与示例

# 观察切换间隔
grep -n "advert_int" /etc/keepalived/keepalived.conf

# 发送 GARP 加速 ARP 更新
arping -I eth0 -c 3 -A 192.168.10.100

HAProxy平滑下线建议

# /etc/haproxy/haproxy.cfg
defaults
    timeout client  60s
    timeout server  60s
    timeout connect 5s

# 保持连接时调整超时,减少断连

6. HAProxy健康检查误判#

现象:后端正常但被标记为 DOWN。
排查步骤与示例

# 查看HAProxy日志中的健康检查失败原因
journalctl -u haproxy | grep -i "health" | tail -n 50

# 模拟健康检查请求
curl -I http://10.0.0.10:8080/health

健康检查配置示例

backend app_backend
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200
    server app1 10.0.0.10:8080 check inter 2s rise 3 fall 2
    server app2 10.0.0.11:8080 check inter 2s rise 3 fall 2

7. 日志不完整或无日志#

现象:Keepalived或HAProxy日志缺失,排障困难。
排查步骤与示例

# rsyslog服务
systemctl status rsyslog

# HAProxy日志目标检查
grep -n "log" /etc/haproxy/haproxy.cfg

HAProxy日志示例

global
    log 127.0.0.1 local0
    maxconn 2000

defaults
    mode http
    option httplog
    log global

rsyslog配置示例

# /etc/rsyslog.d/49-haproxy.conf
local0.*    /var/log/haproxy.log
systemctl restart rsyslog haproxy
tail -f /var/log/haproxy.log

综合排障流程(建议执行顺序)#

  1. 配置一致性:VRRP ID、auth_pass、接口名称、VIP网段是否一致
  2. 网络连通性:心跳网络、VRRP协议112是否通
  3. 进程与脚本:Keepalived/Haproxy进程状态与脚本返回码
  4. 日志与抓包journalctltcpdump -i eth0 vrrp
  5. 故障演练:停主机服务验证漂移与回切

练习与验证#

  1. 模拟VIP漂移
# 在主机停止Keepalived
systemctl stop keepalived
# 在备机验证VIP绑定
ip a | grep -A2 "vip"
  1. 模拟HAProxy故障触发切换
# 停止haproxy,检查Keepalived是否切换
systemctl stop haproxy
journalctl -u keepalived -n 50 --no-pager
  1. 验证GARP刷新ARP
arping -I eth0 -c 3 -A 192.168.10.100