19.5.1 监控与可观测性总体架构

监控与可观测性总体架构#

监控与可观测性总体架构以“数据采集—处理—存储—分析展示—告警联动—闭环改进”为主线,覆盖主机、容器、数据库/中间件与应用。统一指标、日志、链路三大数据模型的标签与时间戳规范,实现可用性可视化、性能可度量、问题可定位、风险可预测。

架构原理草图#

文章图片

统一标签与数据模型示例#

统一标签建议:env、cluster、service、instance、version、team,确保跨系统可关联。

# /etc/observability/labels.yaml
global_labels:
  env: prod
  cluster: k8s-prod-01
  team: platform
service_labels:
  - name: auth-api
    labels:
      service: auth
      version: v1.8.2

指标采集与存储(Prometheus示例)#

安装与启动#

# 1) 安装 Prometheus
useradd -r -s /sbin/nologin prometheus
mkdir -p /opt/prometheus /etc/prometheus
cd /opt/prometheus
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.50.0/prometheus-2.50.0.linux-amd64.tar.gz
tar -xzf prometheus-2.50.0.linux-amd64.tar.gz
cp prometheus-2.50.0.linux-amd64/{prometheus,promtool} /usr/local/bin/

# 2) 安装 node_exporter
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-1.7.0.linux-amd64.tar.gz
cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# 3) 启动 node_exporter
cat >/etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload && systemctl enable --now node_exporter

Prometheus配置与解释#

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s   # 采集周期
  evaluation_interval: 15s # 规则评估周期
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['127.0.0.1:9100'] # node_exporter地址
    relabel_configs:
      - target_label: env
        replacement: prod
      - target_label: team
        replacement: platform
# 启动 Prometheus
cat >/etc/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.enable-lifecycle
[Install]
WantedBy=multi-user.target
EOF
mkdir -p /var/lib/prometheus && chown -R prometheus:prometheus /var/lib/prometheus
systemctl daemon-reload && systemctl enable --now prometheus

# 校验配置与热加载
/usr/local/bin/promtool check config /etc/prometheus/prometheus.yml
curl -X POST http://127.0.0.1:9090/-/reload

预期效果:访问 http://<host>:9090/targets,可见 node 目标为 UP

日志采集与存储(Loki/Promtail示例)#

# /etc/promtail/promtail.yaml
server:
  http_listen_port: 9080
positions:
  filename: /var/lib/promtail/positions.yaml
clients:
  - url: http://loki:3100/loki/api/v1/push
scrape_configs:
  - job_name: syslog
    static_configs:
      - targets: ['localhost']
        labels:
          job: syslog
          __path__: /var/log/*.log
          env: prod
          service: system

启动与验证:

promtail -config.file=/etc/promtail/promtail.yaml &
curl -s http://localhost:9080/metrics | head

链路采集与存储(OpenTelemetry示例)#

# /etc/otel/collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
      http:
exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlp]
# 启动 Collector
otelcol --config /etc/otel/collector.yaml

告警与联动示例(Prometheus+Alertmanager)#

# /etc/prometheus/alerts.yml
groups:
  - name: host-alerts
    rules:
      - alert: HostCpuHigh
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))*100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU高使用率"
          description: "实例 {{ $labels.instance }} CPU 使用率 > 85% (持续5分钟)"
# /etc/alertmanager/alertmanager.yml
route:
  group_by: ['alertname','instance']
  receiver: 'webhook'
receivers:
  - name: 'webhook'
    webhook_configs:
      - url: 'http://automation/api/alert'
# Prometheus加载告警规则
curl -X POST http://127.0.0.1:9090/-/reload

# 模拟告警验证
curl -s http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name=="HostCpuHigh")'

排错清单(含命令)#

  1. 目标状态为 DOWN
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health, lastError: .lastError}'
# 检查端口与防火墙
ss -lntp | grep 9100
  1. 时间戳异常/数据缺口
# 检查系统时间与NTP
timedatectl status
chronyc sources -v
  1. 标签混乱导致聚合失效
# 用 PromQL 查看标签分布
curl -G 'http://127.0.0.1:9090/api/v1/label/env/values'

练习#

  1. 部署 node_exporter 并让 Prometheus 成功抓取;截图 Targets 为 UP。
  2. 编写一条“磁盘使用率 > 80%”的告警规则并触发测试。
  3. 为日志采集添加 serviceversion 标签并在查询中按标签过滤。
  4. 使用 promtool 检查配置,输出错误并修复。