7.7.2 连接与请求异常：超时、502/504与上游故障

连接与请求异常：超时、502/504与上游故障#

本节面向 Nginx 反向代理场景，围绕超时、502/504 与上游故障给出可执行的排障流程、配置示例、验证命令与练习，形成“观察—定位—修复—回归”的闭环。

一、原理草图：请求链路与超时触发点#

二、常见现象与错误码含义（含示例）#

502 Bad Gateway：上游不可达或协议异常
典型日志示例：

2024/05/20 10:01:02 [error] 1234#1234: *88 connect() failed (111: Connection refused) while connecting to upstream, client: 10.0.0.1, server: api.example.com, request: "GET /v1/list HTTP/1.1", upstream: "http://10.0.0.5:8080/v1/list", host: "api.example.com"

504 Gateway Timeout：Nginx 等待上游响应超时
典型日志示例：

2024/05/20 10:05:12 [error] 1234#1234: *102 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.0.2, server: api.example.com, request: "GET /v1/report HTTP/1.1", upstream: "http://10.0.0.6:8080/v1/report", host: "api.example.com"

499：客户端主动断开
典型访问日志（需要自定义 log_format）：

10.0.0.3 - - [20/May/2024:10:08:20 +0800] "GET /v1/slow HTTP/1.1" 499 0 "-" "curl/8.0" rt=30.001 urt=0.000 ustatus=-

三、日志与指标：必须先建立“可观测性样例”#

1）配置可观测日志（/etc/nginx/nginx.conf）

http {
    log_format main_ext '$remote_addr - $remote_user [$time_local] '
                        '"$request" $status $body_bytes_sent '
                        '"$http_referer" "$http_user_agent" '
                        'rt=$request_time urt=$upstream_response_time '
                        'ustatus=$upstream_status uaddr=$upstream_addr';

    access_log /var/log/nginx/access.log main_ext;
    error_log  /var/log/nginx/error.log warn;
}

2）验证配置与重载

nginx -t
nginx -s reload

3）快速筛选慢请求与上游异常

# 近 200 行中找出 502/504/499
tail -n 200 /var/log/nginx/access.log | egrep ' 50[24]| 499 '

# 找出上游响应时间超过 3 秒的请求
awk '$0 ~ /urt=/ { 
  match($0, /urt=([0-9.]+)/, a); 
  if (a[1] > 3) print $0 
}' /var/log/nginx/access.log

四、配置与超时链路统一（含完整示例）#

1）反向代理配置示例（/etc/nginx/conf.d/api.conf）

upstream api_backend {
    server 10.0.0.5:8080 max_fails=3 fail_timeout=10s;
    server 10.0.0.6:8080 max_fails=3 fail_timeout=10s;
    keepalive 64;
}

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_http_version 1.1;
        proxy_set_header Connection "";

        proxy_connect_timeout 3s;
        proxy_send_timeout    30s;
        proxy_read_timeout    30s;

        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;

        proxy_pass http://api_backend;
    }
}

预期效果：连接上游超时 3 秒、读响应超时 30 秒；出现 502/504 时自动切换一次上游。

2）上游应用超时对齐示例（以 Java 为例）

# application.properties
server.connection-timeout=3000
spring.mvc.async.request-timeout=30000

五、排障命令清单（含解释与输出要点）#

# 1) 检查上游端口连通性
nc -zv 10.0.0.5 8080
# 输出成功：succeeded; 失败：Connection refused / timed out

# 2) 查看连接状态与 TIME_WAIT
ss -antp | head -n 20
ss -s

# 3) 验证反向代理请求链路
curl -v http://api.example.com/v1/health

# 4) 直接访问上游对比
curl -v http://10.0.0.5:8080/v1/health

# 5) 捕获超时过程（需要 root）
tcpdump -i eth0 host 10.0.0.5 and port 8080 -nn

六、典型故障场景与修复步骤（含示例）#

场景 1：高峰期 502 激增（连接拒绝）
现象：error.log 中大量 connect() failed (111: Connection refused)
原因：上游连接池满 / 进程拒绝新连接
处理步骤：

# 上游检查：端口与进程状态
ss -lntp | grep 8080
ps -ef | grep your-app

# Nginx 侧增加 keepalive
# /etc/nginx/conf.d/api.conf
# upstream api_backend { keepalive 64; }

nginx -t && nginx -s reload

场景 2：间歇性 504（上游慢）
现象：upstream timed out while reading response header
原因：慢查询或下游依赖过慢
处理步骤：

# 观察慢请求
grep "urt=" /var/log/nginx/access.log | awk '$0 ~ /urt=/ { 
  match($0, /urt=([0-9.]+)/, a); 
  if (a[1] > 10) print $0 
}'

# 临时提高读超时以验证
# /etc/nginx/conf.d/api.conf
# proxy_read_timeout 60s;

nginx -t && nginx -s reload

场景 3：大量 499（客户端提前断开）
现象：访问日志中 499 增多
原因：客户端超时过短或前端取消
处理步骤：

# 找出 499 的接口
awk '$9==499 {print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head

# 与前端/调用方统一超时配置（如 30s）

七、故障模拟与验证练习#

练习 1：模拟 502（上游不可达）

# 停止上游服务（示例）
systemctl stop your-app

# 访问 Nginx
curl -v http://api.example.com/v1/health
# 预期：返回 502，error.log 出现 connect() failed

练习 2：模拟 504（上游响应过慢）

# 上游增加延时接口（伪代码）
# /health?delay=40 让接口睡眠 40 秒

curl -v "http://api.example.com/health?delay=40"
# 预期：30s 后 504

练习 3：验证超时链路对齐

# 统一客户端、Nginx、上游超时为 30s
# 然后请求延迟 25s 的接口应成功
curl -v "http://api.example.com/health?delay=25"

八、本节小结（可复用排障清单）#

先建立日志可观测性（状态码、耗时、上游地址）。
通过 curl 对比 Nginx 与上游直连结果。
确认超时链路一致（客户端/Nginx/上游）。
用 ss、nc、tcpdump 验证连接状态。
调整超时、连接池、keepalive 并回归验证。