7.6.5 监控方案与告警集成(Prometheus/Exporter)
在 Nginx 日志监控中引入 Prometheus,可通过 Exporter 将访问指标、错误统计与连接状态暴露为可抓取的时序数据,并与告警系统联动实现主动发现与快速处置。本节以「Nginx Stub Status + nginx-prometheus-exporter」和「access 日志解析 + nginxlog-exporter」为主线,给出安装、配置、排错与演练步骤。
原理草图(数据流)
方案与适用场景
- Stub Status + nginx-prometheus-exporter:轻量、连接与请求概览监控
- access 日志解析 + nginxlog-exporter:业务指标与路径统计
- Nginx Plus API(如有):更细粒度指标(非开源版)
一、Stub Status + nginx-prometheus-exporter#
1) 安装 Nginx Stub Status#
# 1. 开启 stub_status(Nginx 配置片段)
cat >/etc/nginx/conf.d/stub_status.conf <<'EOF'
server {
listen 127.0.0.1:8080;
location /stub_status {
stub_status;
allow 127.0.0.1;
deny all;
}
}
EOF
# 2. 语法检查并重载
nginx -t && systemctl reload nginx
# 3. 预期效果
curl -s http://127.0.0.1:8080/stub_status
# Active connections: 1
# server accepts handled requests
# 10 10 20
# Reading: 0 Writing: 1 Waiting: 0
关键配置说明
- allow/deny:限制指标访问,避免暴露公网
- listen 127.0.0.1:仅本机可访问
2) 安装 nginx-prometheus-exporter#
# 以二进制方式安装(示例版本可按需替换)
useradd -r -s /sbin/nologin nginxexp
cd /usr/local/src
curl -LO https://github.com/nginxinc/nginx-prometheus-exporter/releases/download/v1.1.0/nginx-prometheus-exporter_1.1.0_linux_amd64.tar.gz
tar -xf nginx-prometheus-exporter_1.1.0_linux_amd64.tar.gz
install -m 755 nginx-prometheus-exporter /usr/local/bin/
# systemd 服务
cat >/etc/systemd/system/nginx-exporter.service <<'EOF'
[Unit]
Description=NGINX Prometheus Exporter
After=network.target
[Service]
User=nginxexp
ExecStart=/usr/local/bin/nginx-prometheus-exporter \
-nginx.scrape-uri=http://127.0.0.1:8080/stub_status
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now nginx-exporter
# 预期效果
curl -s http://127.0.0.1:9113/metrics | head
3) Prometheus 采集配置#
# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: 'nginx'
static_configs:
- targets: ['127.0.0.1:9113']
labels:
env: prod
app: nginx
命令解释
- job_name: Prometheus 任务名,用于聚合与告警
- targets: Exporter 地址
- labels: 标签用于分组与多环境区分
二、access 日志解析 + nginxlog-exporter#
1) Nginx 日志格式(包含上游耗时与状态)#
# /etc/nginx/nginx.conf
log_format main_json escape=json
'{"time":"$time_iso8601","remote_addr":"$remote_addr","host":"$host",'
'"request":"$request","status":$status,"body_bytes_sent":$body_bytes_sent,'
'"request_time":$request_time,"upstream_time":"$upstream_response_time",'
'"upstream_status":"$upstream_status","uri":"$uri"}';
access_log /var/log/nginx/access.log main_json;
关键字段说明
- request_time:总请求耗时
- upstream_response_time:后端耗时(可用于慢请求定位)
2) 安装 nginxlog-exporter#
useradd -r -s /sbin/nologin nginxlogexp
cd /usr/local/src
curl -LO https://github.com/martin-helmich/prometheus-nginxlog-exporter/releases/download/v1.10.0/prometheus-nginxlog-exporter_1.10.0_linux_amd64.tar.gz
tar -xf prometheus-nginxlog-exporter_1.10.0_linux_amd64.tar.gz
install -m 755 prometheus-nginxlog-exporter /usr/local/bin/
# 配置文件
cat >/etc/nginxlog-exporter.yml <<'EOF'
listen:
port: 4040
address: "0.0.0.0"
namespaces:
- name: "nginx_access"
format: "json"
source:
files:
- /var/log/nginx/access.log
labels:
app: "nginx"
histogram_buckets: [0.01,0.05,0.1,0.3,0.5,1,2,5]
relabel:
- from: "uri"
to: "uri"
regex: "/api/v1/\\d+/user/\\d+"
replacement: "/api/v1/{id}/user/{id}"
EOF
# systemd 服务
cat >/etc/systemd/system/nginxlog-exporter.service <<'EOF'
[Unit]
Description=Nginx Log Exporter
After=network.target
[Service]
User=nginxlogexp
ExecStart=/usr/local/bin/prometheus-nginxlog-exporter \
-config-file /etc/nginxlog-exporter.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now nginxlog-exporter
# 预期效果
curl -s http://127.0.0.1:4040/metrics | grep nginx_access | head
参数说明
- format: json: 解析 JSON 日志格式
- histogram_buckets: 请求耗时分桶
- relabel: URI 归一化避免高基数
3) Prometheus 采集配置#
# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: 'nginxlog'
static_configs:
- targets: ['127.0.0.1:4040']
labels:
env: prod
app: nginx
三、告警规则示例(Alertmanager 联动前置)#
# /etc/prometheus/rules/nginx.yml
groups:
- name: nginx.rules
rules:
- alert: NginxHigh5xxRate
expr: |
sum(rate(nginx_http_response_count_total{status=~"5.."}[5m]))
/
sum(rate(nginx_http_response_count_total[5m])) > 0.02
for: 5m
labels:
severity: critical
annotations:
summary: "Nginx 5xx 比率过高"
description: "5xx 比率连续 5 分钟超过 2%"
- alert: NginxQPSDrop
expr: |
sum(rate(nginx_http_response_count_total[5m])) < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Nginx QPS 异常下降"
description: "QPS 低于 10 持续 10 分钟"
说明
- nginx_http_response_count_total 来源于日志 exporter
- rate() 使用 5m 窗口避免抖动
四、Grafana 面板关键指标示例#
建议指标
- QPS / 状态码分布 / 5xx 比率
- active/reading/writing/waiting
- 请求耗时 P95/P99(日志 exporter 直方图)
五、排错清单(命令 + 预期)#
# 1. exporter 端口是否监听
ss -lntp | egrep '9113|4040'
# 2. stub_status 是否可访问
curl -s http://127.0.0.1:8080/stub_status
# 3. 日志是否持续更新
tail -f /var/log/nginx/access.log
# 4. Prometheus 是否采集成功
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[] | {job:.labels.job,health:.health}'
# 5. 常见错误
# - 403:stub_status 被 deny
# - 502:Exporter 目标地址错误
# - 指标为 0:日志格式不匹配或未写入
六、练习与演练#
-
基础演练
- 人为制造 5xx:配置一个不存在的 upstream,验证告警是否触发。
- 预期:NginxHigh5xxRate在 5 分钟内触发。 -
容量演练
- 压测 2 分钟,观察 active/reading/writing/waiting 曲线变化。
- 预期:连接数随压测上升并在结束后回落。 -
日志归一化演练
- 访问/api/v1/123/user/456和/api/v1/789/user/999,观察uri标签是否归一为/api/v1/{id}/user/{id}。
落地建议
- 先部署 stub_status 监控连接与请求概览,再补充日志指标
- 统一标签规范(job、instance、env、app),避免高基数
- 指标端口仅内网可见或加白名单访问