19.5.1 监控与可观测性总体架构
监控与可观测性总体架构#
监控与可观测性总体架构以“数据采集—处理—存储—分析展示—告警联动—闭环改进”为主线,覆盖主机、容器、数据库/中间件与应用。统一指标、日志、链路三大数据模型的标签与时间戳规范,实现可用性可视化、性能可度量、问题可定位、风险可预测。
架构原理草图#
统一标签与数据模型示例#
统一标签建议:env、cluster、service、instance、version、team,确保跨系统可关联。
# /etc/observability/labels.yaml
global_labels:
env: prod
cluster: k8s-prod-01
team: platform
service_labels:
- name: auth-api
labels:
service: auth
version: v1.8.2
指标采集与存储(Prometheus示例)#
安装与启动#
# 1) 安装 Prometheus
useradd -r -s /sbin/nologin prometheus
mkdir -p /opt/prometheus /etc/prometheus
cd /opt/prometheus
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.50.0/prometheus-2.50.0.linux-amd64.tar.gz
tar -xzf prometheus-2.50.0.linux-amd64.tar.gz
cp prometheus-2.50.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
# 2) 安装 node_exporter
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-1.7.0.linux-amd64.tar.gz
cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# 3) 启动 node_exporter
cat >/etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload && systemctl enable --now node_exporter
Prometheus配置与解释#
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s # 采集周期
evaluation_interval: 15s # 规则评估周期
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['127.0.0.1:9100'] # node_exporter地址
relabel_configs:
- target_label: env
replacement: prod
- target_label: team
replacement: platform
# 启动 Prometheus
cat >/etc/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.enable-lifecycle
[Install]
WantedBy=multi-user.target
EOF
mkdir -p /var/lib/prometheus && chown -R prometheus:prometheus /var/lib/prometheus
systemctl daemon-reload && systemctl enable --now prometheus
# 校验配置与热加载
/usr/local/bin/promtool check config /etc/prometheus/prometheus.yml
curl -X POST http://127.0.0.1:9090/-/reload
预期效果:访问 http://<host>:9090/targets,可见 node 目标为 UP。
日志采集与存储(Loki/Promtail示例)#
# /etc/promtail/promtail.yaml
server:
http_listen_port: 9080
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: syslog
static_configs:
- targets: ['localhost']
labels:
job: syslog
__path__: /var/log/*.log
env: prod
service: system
启动与验证:
promtail -config.file=/etc/promtail/promtail.yaml &
curl -s http://localhost:9080/metrics | head
链路采集与存储(OpenTelemetry示例)#
# /etc/otel/collector.yaml
receivers:
otlp:
protocols:
grpc:
http:
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]
# 启动 Collector
otelcol --config /etc/otel/collector.yaml
告警与联动示例(Prometheus+Alertmanager)#
# /etc/prometheus/alerts.yml
groups:
- name: host-alerts
rules:
- alert: HostCpuHigh
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))*100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "CPU高使用率"
description: "实例 {{ $labels.instance }} CPU 使用率 > 85% (持续5分钟)"
# /etc/alertmanager/alertmanager.yml
route:
group_by: ['alertname','instance']
receiver: 'webhook'
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://automation/api/alert'
# Prometheus加载告警规则
curl -X POST http://127.0.0.1:9090/-/reload
# 模拟告警验证
curl -s http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name=="HostCpuHigh")'
排错清单(含命令)#
- 目标状态为 DOWN
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health, lastError: .lastError}'
# 检查端口与防火墙
ss -lntp | grep 9100
- 时间戳异常/数据缺口
# 检查系统时间与NTP
timedatectl status
chronyc sources -v
- 标签混乱导致聚合失效
# 用 PromQL 查看标签分布
curl -G 'http://127.0.0.1:9090/api/v1/label/env/values'
练习#
- 部署 node_exporter 并让 Prometheus 成功抓取;截图 Targets 为 UP。
- 编写一条“磁盘使用率 > 80%”的告警规则并触发测试。
- 为日志采集添加
service与version标签并在查询中按标签过滤。 - 使用
promtool检查配置,输出错误并修复。