19.11.9 多云与混合云运维平台实践案例
在多云与混合云场景中,运维平台需实现对私有云、公共云与本地IDC的统一管理,重点解决资源碎片化、网络互通、身份与权限一致性、监控告警统一、成本与合规治理等问题。本案例以“统一控制面+统一数据面+统一流程”的平台化思路落地,形成可复制的多云运维体系。
原理与架构草图#
核心目标与范围#
- 统一资源视图:纳管公有云与私有云的计算、网络、存储与Kubernetes集群资源
- 统一监控告警:Prometheus联邦或远程写入构建跨云监控体系
- 统一配置与发布:标准化配置基线、镜像与制品管理
- 统一权限与审计:对接统一身份源与权限策略
- 成本与SLA治理:多云成本可视化、SLA/SLO度量与优化
平台架构与关键组件#
- 资源纳管层:云厂商API与私有云API适配器、资源同步与变更订阅
- 资产与拓扑层:CMDB统一模型、业务-应用-资源拓扑关联
- 运维工具层:统一任务编排、自动化执行与灰度发布
- 监控与日志层:多集群Prometheus、统一告警与日志检索
- 安全与审计层:SSO、RBAC、操作审计与合规检查
- 运营与成本层:成本分摊、预算预警、资源优化建议
典型实施步骤与示例#
1) 资源盘点与网络打通#
示例:以AWS/阿里云/私有云OpenStack盘点资源并统一输出
# 1) AWS资源盘点(需安装并配置awscli)
aws --version
aws configure set region ap-southeast-1
aws ec2 describe-instances --query 'Reservations[].Instances[].{ID:InstanceId,AZ:Placement.AvailabilityZone,Type:InstanceType,State:State.Name}' --output table
# 2) 阿里云ECS资源盘点(需安装aliyun-cli并配置AK)
aliyun --version
aliyun ecs DescribeInstances --RegionId cn-hangzhou --PageSize 10 | jq '.Instances.Instance[] | {id:.InstanceId, zone:.ZoneId, type:.InstanceType, status:.Status}'
# 3) OpenStack盘点(需source OpenStack凭据)
source /etc/kolla/admin-openrc.sh
openstack server list -f json | jq '.[] | {id:.ID, name:.Name, status:.Status}'
预期效果:三类云资源输出字段对齐,为CMDB入库做准备。
网络互通检查示例
# 从本地IDC到云VPC内网探测
ping -c 3 10.10.10.10
# 测试TCP连通(例如K8s API或数据库端口)
nc -vz 10.10.10.10 6443
2) 统一资产模型与标签策略#
示例:CMDB统一字段(YAML)
# /opt/cmdb/models/resource.yaml
resource_id: "i-xxxxxxxx"
cloud_type: "aws|aliyun|openstack|idc"
region: "ap-southeast-1"
zone: "ap-southeast-1a"
env: "prod|staging|dev"
app: "order-service"
owner: "team-a"
cost_center: "CC1001"
资源入库脚本示例(Python伪实现)
# /opt/cmdb/scripts/import_resources.py
import json, requests
def upsert(resource):
r = requests.post("https://cmdb.example.com/api/v1/resources", json=resource, timeout=10)
print(r.status_code, r.text)
with open("/opt/cmdb/data/aws_instances.json") as f:
data = json.load(f)
for ins in data:
resource = {
"resource_id": ins["ID"],
"cloud_type": "aws",
"region": "ap-southeast-1",
"env": "prod",
"app": "order-service",
"owner": "team-a",
"cost_center": "CC1001"
}
upsert(resource)
3) 统一监控指标(Prometheus联邦/Remote Write)#
安装与配置示例(Prometheus联邦)
# 1) 安装Prometheus(以Linux为例)
useradd -M -s /sbin/nologin prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar -xf prometheus-2.48.0.linux-amd64.tar.gz -C /opt/
ln -s /opt/prometheus-2.48.0.linux-amd64 /opt/prometheus
# 2) 配置联邦聚合
cat >/opt/prometheus/prometheus.yml <<'EOF'
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'federate'
scrape_interval: 60s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"k8s|node|mysql"}'
static_configs:
- targets:
- 'prometheus-aws.example.com:9090'
- 'prometheus-aliyun.example.com:9090'
- 'prometheus-private.example.com:9090'
EOF
# 3) 启动
/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml \
--storage.tsdb.path=/opt/prometheus/data \
--web.listen-address=0.0.0.0:9090
预期效果:聚合端查询可跨云获取统一指标。
统一告警规则示例
# /opt/prometheus/rules/multi-cloud.rules.yml
groups:
- name: multi-cloud
rules:
- alert: HighCPUUsage
expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "Instance CPU高"
description: "实例{{ $labels.instance }} CPU使用率超过80%"
4) 统一发布与变更(示例:跨云K8s灰度发布)#
# 1) 多集群kubectl上下文切换
kubectl config get-contexts
kubectl config use-context k8s-aws
# 2) 灰度发布(AWS集群)
kubectl -n prod apply -f deploy-v2.yaml
kubectl -n prod rollout status deploy/order-service
# 3) 切换到阿里云集群执行同样变更
kubectl config use-context k8s-aliyun
kubectl -n prod apply -f deploy-v2.yaml
kubectl -n prod rollout status deploy/order-service
回滚命令
kubectl -n prod rollout undo deploy/order-service
5) 权限与审计整合(示例:K8s RBAC与审计)#
# /opt/k8s/rbac/team-a-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: prod
name: team-a-role
rules:
- apiGroups: ["", "apps"]
resources: ["pods", "deployments"]
verbs: ["get","list","watch","create","update","patch","delete"]
kubectl apply -f /opt/k8s/rbac/team-a-role.yaml
kubectl -n prod create rolebinding team-a-binding \
--role=team-a-role --user=team-a@example.com
6) 成本治理与资源优化#
示例:成本标签校验(Shell)
# /opt/cost/check_tags.sh
#!/bin/bash
set -e
aws ec2 describe-instances --query 'Reservations[].Instances[].Tags' --output json \
| jq '.[] | select(map(.Key)|index("cost_center")|not)'
# 预期:输出缺少cost_center标签的实例
常见排错清单与命令#
- 资源未同步:检查API权限与网络
# 验证云API连通性与权限
aws sts get-caller-identity
aliyun ecs DescribeRegions | jq '.Regions.Region[] | .RegionId'
openstack token issue
- Prometheus联邦无数据:检查/federate与指标匹配
curl -s "http://prometheus-aws.example.com:9090/federate?match[]=up" | head
grep "federate" /opt/prometheus/prometheus.yml
- 跨云K8s操作失败:检查context与kubeconfig
kubectl config current-context
kubectl cluster-info
- 告警不触发:检查规则加载与表达式
curl -s http://localhost:9090/-/rules | jq '.groups[].rules[] | {name:.name, state:.state}'
练习与检查点#
- 使用三种云盘点命令导出资源,并转换为统一字段后写入CMDB(可模拟API)。
- 部署Prometheus聚合端,验证能查询到三云的
up指标。 - 完成跨云K8s部署与回滚演练,输出rollout状态。
- 写一个脚本检查缺失
cost_center标签的资源并输出清单。
可复制的落地模板#
- 统一资源模型与标签规范
- 多云监控指标字典与告警策略库
- 统一变更流程与自动化回滚模板
- 成本归集规则与优化建议清单