13.7.7 常见问题与排障思路
常见问题与排障思路#
原理草图:HAProxy+Keepalived高可用故障流转
0. 基础安装与日志准备(排障前提)#
目的:确保组件安装、日志可用、关键内核参数就绪,排障时有证据链。
# 安装(CentOS/RHEL)
yum install -y haproxy keepalived rsyslog
# 启动与开机自启
systemctl enable --now haproxy keepalived rsyslog
# 日志检查(Keepalived)
journalctl -u keepalived -n 100 --no-pager
# 日志检查(HAProxy)
journalctl -u haproxy -n 100 --no-pager
关键内核参数(VIP漂移/ARP)
# /etc/sysctl.d/99-keepalived.conf
net.ipv4.ip_nonlocal_bind = 1
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
sysctl -p /etc/sysctl.d/99-keepalived.conf
1. VIP无法漂移或漂移失败#
现象:主机宕机后VIP仍停留在原主机,或新主机未绑定VIP。
排查步骤与示例
# 1) 查看Keepalived状态与日志
systemctl status keepalived
journalctl -u keepalived -n 200 --no-pager | tail -n 50
# 2) 校验VIP是否绑定
ip a | grep -A2 "vip"
# 3) 检查VRRP协议是否被防火墙阻断(协议号112)
iptables -L -n | grep 112
firewall-cmd --list-all | grep vrrp
VRRP配置核对示例(两端需一致)
# /etc/keepalived/keepalived.conf
vrrp_instance VI_1 {
state MASTER # 备机为 BACKUP
interface eth0
virtual_router_id 51 # 两端必须一致
priority 150 # 备机 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111 # 两端必须一致
}
virtual_ipaddress {
192.168.10.100/24 dev eth0 label eth0:vip
}
}
预期效果
- 主机停止 Keepalived 后,备机 1-3 秒内绑定 VIP:
systemctl stop keepalived
ip a | grep -A2 "vip"
2. 双主(Split-Brain)问题#
现象:两台机器同时绑定VIP,对外访问异常。
排查步骤与示例
# 两端确认是否同时存在 VIP
ip a | grep -A2 "vip"
# 检查心跳网络是否抖动
ping -c 10 <peer_ip>
mtr -r -c 20 <peer_ip>
Keepalived配置防抖建议
vrrp_instance VI_1 {
priority 150
nopreempt # 防止网络抖动时频繁抢占
advert_int 1
unicast_peer { # 单播模式避免二层广播干扰
192.168.10.12
}
}
说明:nopreempt 禁止低优先级节点抢占,避免双主风险。
3. 切换频繁抖动#
现象:主备频繁切换,业务波动明显。
排查步骤与示例
# 查看Keepalived切换日志(频繁 "Transition to MASTER")
journalctl -u keepalived | grep -E "Transition|Entering"
# 检查健康脚本运行频率与状态
grep vrrp_script -n /etc/keepalived/keepalived.conf
脚本与阈值建议
vrrp_script chk_haproxy {
script "/etc/keepalived/check_haproxy.sh"
interval 3 # 检查间隔
fall 3 # 连续失败次数触发
rise 2 # 连续成功次数恢复
weight -30
}
# /etc/keepalived/check_haproxy.sh
#!/bin/bash
pidof haproxy >/dev/null 2>&1
# 返回码:0 正常,非0异常
4. HAProxy联动脚本未生效#
现象:HAProxy异常停止,但Keepalived未触发切换。
排查步骤与示例
# 1) 脚本权限
ls -l /etc/keepalived/check_haproxy.sh
# 2) 手动执行检查返回码
/etc/keepalived/check_haproxy.sh; echo $?
# 3) track_script 是否绑定
grep -n "track_script" -n /etc/keepalived/keepalived.conf
Keepalived联动配置
vrrp_instance VI_1 {
track_script {
chk_haproxy
}
}
说明:weight -30 必须足够让 MASTER 优先级下降,触发切换。
5. 客户端访问中断或连接重置#
现象:切换后短时访问中断或长连接断开。
排查步骤与示例
# 观察切换间隔
grep -n "advert_int" /etc/keepalived/keepalived.conf
# 发送 GARP 加速 ARP 更新
arping -I eth0 -c 3 -A 192.168.10.100
HAProxy平滑下线建议
# /etc/haproxy/haproxy.cfg
defaults
timeout client 60s
timeout server 60s
timeout connect 5s
# 保持连接时调整超时,减少断连
6. HAProxy健康检查误判#
现象:后端正常但被标记为 DOWN。
排查步骤与示例
# 查看HAProxy日志中的健康检查失败原因
journalctl -u haproxy | grep -i "health" | tail -n 50
# 模拟健康检查请求
curl -I http://10.0.0.10:8080/health
健康检查配置示例
backend app_backend
balance roundrobin
option httpchk GET /health
http-check expect status 200
server app1 10.0.0.10:8080 check inter 2s rise 3 fall 2
server app2 10.0.0.11:8080 check inter 2s rise 3 fall 2
7. 日志不完整或无日志#
现象:Keepalived或HAProxy日志缺失,排障困难。
排查步骤与示例
# rsyslog服务
systemctl status rsyslog
# HAProxy日志目标检查
grep -n "log" /etc/haproxy/haproxy.cfg
HAProxy日志示例
global
log 127.0.0.1 local0
maxconn 2000
defaults
mode http
option httplog
log global
rsyslog配置示例
# /etc/rsyslog.d/49-haproxy.conf
local0.* /var/log/haproxy.log
systemctl restart rsyslog haproxy
tail -f /var/log/haproxy.log
综合排障流程(建议执行顺序)#
- 配置一致性:VRRP ID、auth_pass、接口名称、VIP网段是否一致
- 网络连通性:心跳网络、VRRP协议112是否通
- 进程与脚本:Keepalived/Haproxy进程状态与脚本返回码
- 日志与抓包:
journalctl、tcpdump -i eth0 vrrp - 故障演练:停主机服务验证漂移与回切
练习与验证#
- 模拟VIP漂移
# 在主机停止Keepalived
systemctl stop keepalived
# 在备机验证VIP绑定
ip a | grep -A2 "vip"
- 模拟HAProxy故障触发切换
# 停止haproxy,检查Keepalived是否切换
systemctl stop haproxy
journalctl -u keepalived -n 50 --no-pager
- 验证GARP刷新ARP
arping -I eth0 -c 3 -A 192.168.10.100