13.5.6 常见问题与排障思路

常见问题排障建议从“现象—指标—配置—后端”四条线并行推进，结合日志与统计页快速定位。

1. 后端被误判为宕机#

现象：后端频繁被摘除，日志出现Server ... is DOWN。
排查与命令

# 1) 从HAProxy节点验证健康检查URL/端口
curl -sS -o /dev/null -w "%{http_code}\n" http://10.0.0.21:8080/healthz

# 2) 检查后端端口连通性
nc -vz 10.0.0.21 8080

# 3) 查看日志中DOWN原因
grep -E "is DOWN|health check" /var/log/haproxy.log | tail -n 20

配置核对示例（/etc/haproxy/haproxy.cfg）

backend app_be
    option httpchk GET /healthz
    http-check expect status 200
    default-server inter 2000 fall 3 rise 2
    server app1 10.0.0.21:8080 check
    server app2 10.0.0.22:8080 check

处理：放宽inter/fall/rise；完善HTTP检查路径与响应；放行健康检查流量。

2. 健康检查通过但真实请求失败#

现象：检查正常，业务请求大量5xx/超时。
排查与命令

# 1) 复现业务路径请求
curl -sS -o /dev/null -w "%{http_code}\n" http://10.0.0.21:8080/api/orders

# 2) 查看HAProxy统计页(若已启用)
curl -s http://127.0.0.1:8404/stats

配置增强示例

backend app_be
    option httpchk GET /healthz
    http-check expect rstatus ^2..$
    # 对业务关键路径做更严格校验
    http-check send meth GET uri /api/ping ver HTTP/1.1 hdr Host app.example.com
    http-check expect string "OK"
    default-server maxconn 200 slowstart 10s

处理：使用http-check expect校验响应内容；提高maxconn与slowstart；完善后端限流与熔断。

3. 频繁抖动（UP/DOWN波动）#

现象：后端状态在日志中频繁切换。
排查与命令

# 1) 测试网络延迟抖动
ping -c 20 10.0.0.21

# 2) 检查DNS是否不稳定（如果用域名）
dig +short app-backend.example.com

配置缓解示例

backend app_be
    default-server inter 2000 fall 5 rise 3
    option httpchk GET /healthz
    observe layer7

处理：增大rise/fall；启用observe layer7排除短时失败；定位后端资源瓶颈并扩容。

4. 故障转移不生效或切换慢#

现象：故障后仍有流量打到故障节点。
排查与命令

# 1) 模拟后端宕机
sudo systemctl stop app-backend

# 2) 观察stats中状态变化
watch -n 1 "curl -s http://127.0.0.1:8404/stats | grep app1"

配置建议

backend app_be
    option redispatch
    retries 2
    timeout check 2s
    timeout server 5s

处理：优化timeout check；调整会话保持策略；配置合理的retries与option redispatch。

5. 后端恢复后无法重新加入#

现象：后端已恢复但仍显示MAINT或DOWN。
排查与命令

# 1) 检查是否被标记为维护
echo "show servers state app_be/app1" | socat stdio /run/haproxy/admin.sock

# 2) 取消维护(需要开启admin socket)
echo "set server app_be/app1 state ready" | socat stdio /run/haproxy/admin.sock

配置检查示例

global
    stats socket /run/haproxy/admin.sock mode 600 level admin

处理：取消维护标记；调整rise与检查路径；确认后端依赖服务可用。

6. 统计页与日志缺失关键信息#

现象：无法快速定位原因。
排查与配置

global
    log /dev/log local0
defaults
    option httplog
    log-format "%ci:%cp [%t] %ft %b/%s %TR/%Tw/%Tc/%Tr/%Ta %ST %B %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq"
    timeout connect 5s
    timeout client 30s
    timeout server 30s

listen stats
    bind 0.0.0.0:8404
    stats enable
    stats uri /stats
    stats refresh 5s
    stats auth admin:Admin@123

验证命令

curl -s http://admin:Admin@123@127.0.0.1:8404/stats | head -n 5

处理：完善日志格式，输出后端状态与重试原因；启用统计页并限制访问。

7. SSL或SNI导致检查异常#

现象：HTTPS后端检查失败。
排查与命令

# 1) 直接验证后端HTTPS与证书链
curl -vk https://10.0.0.31:8443/healthz

# 2) 验证SNI
openssl s_client -connect 10.0.0.31:8443 -servername app.example.com </dev/null

配置示例

backend https_be
    option httpchk GET /healthz
    server app1 10.0.0.31:8443 ssl verify none check-sni app.example.com

处理：测试环境可用ssl verify none；生产环境导入CA并启用校验；指定check-sni。

8. 排障流程建议（可复现）#

步骤与命令

# 1) 先看stats状态
curl -s http://127.0.0.1:8404/stats | grep -E "app1|app2"

# 2) 查日志定位UP/DOWN
tail -n 50 /var/log/haproxy.log | grep -E "is (UP|DOWN)"

# 3) 验证检查路径
curl -sS -o /dev/null -w "%{http_code}\n" http://10.0.0.21:8080/healthz

# 4) 核对配置并重载
haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy

练习与验证#

将inter从2000改为1000并模拟抖动，观察日志中UP/DOWN频率变化。
配置http-check expect string "OK"，让后端返回ERROR，观察stats中的HCHK状态。
使用admin socket将app1置为维护，再恢复为ready，观察流量分布变化。