7.6.5 监控方案与告警集成(Prometheus/Exporter)

在 Nginx 日志监控中引入 Prometheus,可通过 Exporter 将访问指标、错误统计与连接状态暴露为可抓取的时序数据,并与告警系统联动实现主动发现与快速处置。本节以「Nginx Stub Status + nginx-prometheus-exporter」和「access 日志解析 + nginxlog-exporter」为主线,给出安装、配置、排错与演练步骤。

原理草图(数据流)

文章图片

方案与适用场景
- Stub Status + nginx-prometheus-exporter:轻量、连接与请求概览监控
- access 日志解析 + nginxlog-exporter:业务指标与路径统计
- Nginx Plus API(如有):更细粒度指标(非开源版)


一、Stub Status + nginx-prometheus-exporter#

1) 安装 Nginx Stub Status#

# 1. 开启 stub_status(Nginx 配置片段)
cat >/etc/nginx/conf.d/stub_status.conf <<'EOF'
server {
    listen 127.0.0.1:8080;
    location /stub_status {
        stub_status;
        allow 127.0.0.1;
        deny all;
    }
}
EOF

# 2. 语法检查并重载
nginx -t && systemctl reload nginx

# 3. 预期效果
curl -s http://127.0.0.1:8080/stub_status
# Active connections: 1
# server accepts handled requests
#  10 10 20
# Reading: 0 Writing: 1 Waiting: 0

关键配置说明
- allow/deny:限制指标访问,避免暴露公网
- listen 127.0.0.1:仅本机可访问

2) 安装 nginx-prometheus-exporter#

# 以二进制方式安装(示例版本可按需替换)
useradd -r -s /sbin/nologin nginxexp
cd /usr/local/src
curl -LO https://github.com/nginxinc/nginx-prometheus-exporter/releases/download/v1.1.0/nginx-prometheus-exporter_1.1.0_linux_amd64.tar.gz
tar -xf nginx-prometheus-exporter_1.1.0_linux_amd64.tar.gz
install -m 755 nginx-prometheus-exporter /usr/local/bin/

# systemd 服务
cat >/etc/systemd/system/nginx-exporter.service <<'EOF'
[Unit]
Description=NGINX Prometheus Exporter
After=network.target

[Service]
User=nginxexp
ExecStart=/usr/local/bin/nginx-prometheus-exporter \
  -nginx.scrape-uri=http://127.0.0.1:8080/stub_status
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now nginx-exporter

# 预期效果
curl -s http://127.0.0.1:9113/metrics | head

3) Prometheus 采集配置#

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'nginx'
    static_configs:
      - targets: ['127.0.0.1:9113']
        labels:
          env: prod
          app: nginx

命令解释
- job_name: Prometheus 任务名,用于聚合与告警
- targets: Exporter 地址
- labels: 标签用于分组与多环境区分


二、access 日志解析 + nginxlog-exporter#

1) Nginx 日志格式(包含上游耗时与状态)#

# /etc/nginx/nginx.conf
log_format main_json escape=json
'{"time":"$time_iso8601","remote_addr":"$remote_addr","host":"$host",'
'"request":"$request","status":$status,"body_bytes_sent":$body_bytes_sent,'
'"request_time":$request_time,"upstream_time":"$upstream_response_time",'
'"upstream_status":"$upstream_status","uri":"$uri"}';

access_log /var/log/nginx/access.log main_json;

关键字段说明
- request_time:总请求耗时
- upstream_response_time:后端耗时(可用于慢请求定位)

2) 安装 nginxlog-exporter#

useradd -r -s /sbin/nologin nginxlogexp
cd /usr/local/src
curl -LO https://github.com/martin-helmich/prometheus-nginxlog-exporter/releases/download/v1.10.0/prometheus-nginxlog-exporter_1.10.0_linux_amd64.tar.gz
tar -xf prometheus-nginxlog-exporter_1.10.0_linux_amd64.tar.gz
install -m 755 prometheus-nginxlog-exporter /usr/local/bin/

# 配置文件
cat >/etc/nginxlog-exporter.yml <<'EOF'
listen:
  port: 4040
  address: "0.0.0.0"

namespaces:
  - name: "nginx_access"
    format: "json"
    source:
      files:
        - /var/log/nginx/access.log
    labels:
      app: "nginx"
    histogram_buckets: [0.01,0.05,0.1,0.3,0.5,1,2,5]
    relabel:
      - from: "uri"
        to: "uri"
        regex: "/api/v1/\\d+/user/\\d+"
        replacement: "/api/v1/{id}/user/{id}"
EOF

# systemd 服务
cat >/etc/systemd/system/nginxlog-exporter.service <<'EOF'
[Unit]
Description=Nginx Log Exporter
After=network.target

[Service]
User=nginxlogexp
ExecStart=/usr/local/bin/prometheus-nginxlog-exporter \
  -config-file /etc/nginxlog-exporter.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now nginxlog-exporter

# 预期效果
curl -s http://127.0.0.1:4040/metrics | grep nginx_access | head

参数说明
- format: json: 解析 JSON 日志格式
- histogram_buckets: 请求耗时分桶
- relabel: URI 归一化避免高基数

3) Prometheus 采集配置#

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'nginxlog'
    static_configs:
      - targets: ['127.0.0.1:4040']
        labels:
          env: prod
          app: nginx

三、告警规则示例(Alertmanager 联动前置)#

# /etc/prometheus/rules/nginx.yml
groups:
- name: nginx.rules
  rules:
  - alert: NginxHigh5xxRate
    expr: |
      sum(rate(nginx_http_response_count_total{status=~"5.."}[5m]))
      /
      sum(rate(nginx_http_response_count_total[5m])) > 0.02
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Nginx 5xx 比率过高"
      description: "5xx 比率连续 5 分钟超过 2%"

  - alert: NginxQPSDrop
    expr: |
      sum(rate(nginx_http_response_count_total[5m])) < 10
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Nginx QPS 异常下降"
      description: "QPS 低于 10 持续 10 分钟"

说明
- nginx_http_response_count_total 来源于日志 exporter
- rate() 使用 5m 窗口避免抖动


四、Grafana 面板关键指标示例#

建议指标
- QPS / 状态码分布 / 5xx 比率
- active/reading/writing/waiting
- 请求耗时 P95/P99(日志 exporter 直方图)


五、排错清单(命令 + 预期)#

# 1. exporter 端口是否监听
ss -lntp | egrep '9113|4040'

# 2. stub_status 是否可访问
curl -s http://127.0.0.1:8080/stub_status

# 3. 日志是否持续更新
tail -f /var/log/nginx/access.log

# 4. Prometheus 是否采集成功
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[] | {job:.labels.job,health:.health}'

# 5. 常见错误
# - 403:stub_status 被 deny
# - 502:Exporter 目标地址错误
# - 指标为 0:日志格式不匹配或未写入

六、练习与演练#

  1. 基础演练
    - 人为制造 5xx:配置一个不存在的 upstream,验证告警是否触发。
    - 预期:NginxHigh5xxRate 在 5 分钟内触发。

  2. 容量演练
    - 压测 2 分钟,观察 active/reading/writing/waiting 曲线变化。
    - 预期:连接数随压测上升并在结束后回落。

  3. 日志归一化演练
    - 访问 /api/v1/123/user/456/api/v1/789/user/999,观察 uri 标签是否归一为 /api/v1/{id}/user/{id}


落地建议
- 先部署 stub_status 监控连接与请求概览,再补充日志指标
- 统一标签规范(job、instance、env、app),避免高基数
- 指标端口仅内网可见或加白名单访问