19.11.9 多云与混合云运维平台实践案例

在多云与混合云场景中，运维平台需实现对私有云、公共云与本地IDC的统一管理，重点解决资源碎片化、网络互通、身份与权限一致性、监控告警统一、成本与合规治理等问题。本案例以“统一控制面+统一数据面+统一流程”的平台化思路落地，形成可复制的多云运维体系。

原理与架构草图#

核心目标与范围#

统一资源视图：纳管公有云与私有云的计算、网络、存储与Kubernetes集群资源
统一监控告警：Prometheus联邦或远程写入构建跨云监控体系
统一配置与发布：标准化配置基线、镜像与制品管理
统一权限与审计：对接统一身份源与权限策略
成本与SLA治理：多云成本可视化、SLA/SLO度量与优化

平台架构与关键组件#

资源纳管层：云厂商API与私有云API适配器、资源同步与变更订阅
资产与拓扑层：CMDB统一模型、业务-应用-资源拓扑关联
运维工具层：统一任务编排、自动化执行与灰度发布
监控与日志层：多集群Prometheus、统一告警与日志检索
安全与审计层：SSO、RBAC、操作审计与合规检查
运营与成本层：成本分摊、预算预警、资源优化建议

典型实施步骤与示例#

1) 资源盘点与网络打通#

示例：以AWS/阿里云/私有云OpenStack盘点资源并统一输出

# 1) AWS资源盘点（需安装并配置awscli）
aws --version
aws configure set region ap-southeast-1
aws ec2 describe-instances --query 'Reservations[].Instances[].{ID:InstanceId,AZ:Placement.AvailabilityZone,Type:InstanceType,State:State.Name}' --output table

# 2) 阿里云ECS资源盘点（需安装aliyun-cli并配置AK）
aliyun --version
aliyun ecs DescribeInstances --RegionId cn-hangzhou --PageSize 10 | jq '.Instances.Instance[] | {id:.InstanceId, zone:.ZoneId, type:.InstanceType, status:.Status}'

# 3) OpenStack盘点（需source OpenStack凭据）
source /etc/kolla/admin-openrc.sh
openstack server list -f json | jq '.[] | {id:.ID, name:.Name, status:.Status}'

预期效果：三类云资源输出字段对齐，为CMDB入库做准备。

网络互通检查示例

# 从本地IDC到云VPC内网探测
ping -c 3 10.10.10.10

# 测试TCP连通（例如K8s API或数据库端口）
nc -vz 10.10.10.10 6443

2) 统一资产模型与标签策略#

示例：CMDB统一字段（YAML）

# /opt/cmdb/models/resource.yaml
resource_id: "i-xxxxxxxx"
cloud_type: "aws|aliyun|openstack|idc"
region: "ap-southeast-1"
zone: "ap-southeast-1a"
env: "prod|staging|dev"
app: "order-service"
owner: "team-a"
cost_center: "CC1001"

资源入库脚本示例（Python伪实现）

# /opt/cmdb/scripts/import_resources.py
import json, requests

def upsert(resource):
    r = requests.post("https://cmdb.example.com/api/v1/resources", json=resource, timeout=10)
    print(r.status_code, r.text)

with open("/opt/cmdb/data/aws_instances.json") as f:
    data = json.load(f)
for ins in data:
    resource = {
        "resource_id": ins["ID"],
        "cloud_type": "aws",
        "region": "ap-southeast-1",
        "env": "prod",
        "app": "order-service",
        "owner": "team-a",
        "cost_center": "CC1001"
    }
    upsert(resource)

3) 统一监控指标（Prometheus联邦/Remote Write）#

安装与配置示例（Prometheus联邦）

# 1) 安装Prometheus（以Linux为例）
useradd -M -s /sbin/nologin prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar -xf prometheus-2.48.0.linux-amd64.tar.gz -C /opt/
ln -s /opt/prometheus-2.48.0.linux-amd64 /opt/prometheus

# 2) 配置联邦聚合
cat >/opt/prometheus/prometheus.yml <<'EOF'
global:
  scrape_interval: 30s

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 60s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~"k8s|node|mysql"}'
    static_configs:
      - targets:
          - 'prometheus-aws.example.com:9090'
          - 'prometheus-aliyun.example.com:9090'
          - 'prometheus-private.example.com:9090'
EOF

# 3) 启动
/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data \
  --web.listen-address=0.0.0.0:9090

预期效果：聚合端查询可跨云获取统一指标。

统一告警规则示例

# /opt/prometheus/rules/multi-cloud.rules.yml
groups:
- name: multi-cloud
  rules:
  - alert: HighCPUUsage
    expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance CPU高"
      description: "实例{{ $labels.instance }} CPU使用率超过80%"

4) 统一发布与变更（示例：跨云K8s灰度发布）#

# 1) 多集群kubectl上下文切换
kubectl config get-contexts
kubectl config use-context k8s-aws

# 2) 灰度发布（AWS集群）
kubectl -n prod apply -f deploy-v2.yaml
kubectl -n prod rollout status deploy/order-service

# 3) 切换到阿里云集群执行同样变更
kubectl config use-context k8s-aliyun
kubectl -n prod apply -f deploy-v2.yaml
kubectl -n prod rollout status deploy/order-service

回滚命令

kubectl -n prod rollout undo deploy/order-service

5) 权限与审计整合（示例：K8s RBAC与审计）#

# /opt/k8s/rbac/team-a-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: prod
  name: team-a-role
rules:
- apiGroups: ["", "apps"]
  resources: ["pods", "deployments"]
  verbs: ["get","list","watch","create","update","patch","delete"]

kubectl apply -f /opt/k8s/rbac/team-a-role.yaml
kubectl -n prod create rolebinding team-a-binding \
  --role=team-a-role --user=team-a@example.com

6) 成本治理与资源优化#

示例：成本标签校验（Shell）

# /opt/cost/check_tags.sh
#!/bin/bash
set -e
aws ec2 describe-instances --query 'Reservations[].Instances[].Tags' --output json \
| jq '.[] | select(map(.Key)|index("cost_center")|not)'

# 预期：输出缺少cost_center标签的实例

常见排错清单与命令#

资源未同步：检查API权限与网络

# 验证云API连通性与权限
aws sts get-caller-identity
aliyun ecs DescribeRegions | jq '.Regions.Region[] | .RegionId'
openstack token issue

Prometheus联邦无数据：检查/federate与指标匹配

curl -s "http://prometheus-aws.example.com:9090/federate?match[]=up" | head
grep "federate" /opt/prometheus/prometheus.yml

跨云K8s操作失败：检查context与kubeconfig

kubectl config current-context
kubectl cluster-info

告警不触发：检查规则加载与表达式

curl -s http://localhost:9090/-/rules | jq '.groups[].rules[] | {name:.name, state:.state}'

练习与检查点#

使用三种云盘点命令导出资源，并转换为统一字段后写入CMDB（可模拟API）。
部署Prometheus聚合端，验证能查询到三云的up指标。
完成跨云K8s部署与回滚演练，输出rollout状态。
写一个脚本检查缺失cost_center标签的资源并输出清单。

可复制的落地模板#

统一资源模型与标签规范
多云监控指标字典与告警策略库
统一变更流程与自动化回滚模板
成本归集规则与优化建议清单