16.6.8 调度与资源管理实战排障

本节聚焦调度与资源管理的实战排障方法，围绕“现象—定位—验证—修复—复盘”构建快速闭环。包含原理草图、安装依赖（Metrics Server/CA示例）、完整排障演练与练习。

一、常见故障现象与定位思路#

1. Pod 长时间 Pending#

典型原因：节点资源不足、亲和性/反亲和性冲突、污点未容忍、配额不足、节点不可调度。
定位命令与解释：

kubectl get pod -n <ns>
# 观察 STATUS 是否为 Pending

kubectl describe pod <pod> -n <ns>
# 查看 Events 中 FailedScheduling 的原因

kubectl get node
# 查看节点是否 SchedulingDisabled

kubectl describe node <node>
# 查看 Taints/Conditions/Allocatable

2. Pod 频繁被驱逐（Evicted）#

典型原因：节点内存/磁盘压力、资源请求不合理、镜像拉取占满磁盘。
定位命令与解释：

kubectl describe pod <pod> -n <ns>
# Reason: Evicted 及 message 中的压力类型

kubectl describe node <node>
# Conditions 中是否出现 MemoryPressure/DiskPressure

3. 调度不均衡或热点节点#

典型原因：亲和性设置过强、资源请求过高、节点标签不一致。
定位命令与解释：

kubectl get pods -o wide -A
# 查看 Pod 分布在哪些节点

kubectl get node --show-labels
# 校验标签一致性

4. HPA/VPA 不生效#

典型原因：Metrics Server 异常、资源请求未定义、指标采集延迟。
定位命令与解释：

kubectl get hpa -n <ns>
kubectl describe hpa <name> -n <ns>
# 重点看 Conditions 与 Current Metrics

kubectl top pod -n <ns>
# 无输出多为 Metrics Server 异常

5. 配额/限额导致创建失败#

典型原因：Namespace ResourceQuota 与 LimitRange 限制。
定位命令与解释：

kubectl get resourcequota -n <ns>
kubectl describe resourcequota <name> -n <ns>

kubectl get limitrange -n <ns>
kubectl describe limitrange <name> -n <ns>

二、调度排障原理草图与流程#

1. 调度与资源管理关键链路（原理草图）#

2. 推荐排障流程（闭环）#

确认调度失败原因：kubectl describe pod 查看 Events 的 FailedScheduling。
验证节点可调度状态：kubectl get node、kubectl describe node。
检查资源请求与限制：对比 requests/limits 与 Allocatable。
核查亲和性/污点：核对 nodeSelector/affinity/tolerations。
核查配额与限额：ResourceQuota/LimitRange。
验证指标链路：Metrics Server/Prometheus Adapter。
回滚与最小变更验证：临时降低请求或移除约束验证。

三、关键安装与依赖检查（Metrics Server/CA 示例）#

说明：HPA/VPA 依赖 Metrics Server；自动扩容依赖 Cluster Autoscaler。

1. 安装 Metrics Server（示例）#

# 下载官方清单
curl -L -o /tmp/metrics-server.yaml \
  https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# 如集群使用自签证书，补充 --kubelet-insecure-tls
sed -i '/- --kubelet-preferred-address-types/a\        - --kubelet-insecure-tls' \
  /tmp/metrics-server.yaml

kubectl apply -f /tmp/metrics-server.yaml

# 验证
kubectl get pods -n kube-system | grep metrics-server
kubectl top node

2. 安装 Cluster Autoscaler（示例：配置片段）#

# /etc/kubernetes/ca/cluster-autoscaler.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
        command:
        - ./cluster-autoscaler
        - --cloud-provider=clusterapi
        - --nodes=1:5:worker-group
        - --scale-down-enabled=true
        - --balance-similar-node-groups=true

kubectl apply -f /etc/kubernetes/ca/cluster-autoscaler.yaml
kubectl logs -n kube-system deploy/cluster-autoscaler | tail -n 50

四、典型案例与完整排障演练#

案例1：Pod Pending（资源不足）#

现象：Events 提示 Insufficient cpu/memory
演练目标：复现场景 -> 验证 -> 修复

复现（创建资源请求过高的 Pod）

# /tmp/pod-over-request.yaml
apiVersion: v1
kind: Pod
metadata:
  name: over-request
  namespace: default
spec:
  containers:
  - name: app
    image: nginx:1.25
    resources:
      requests:
        cpu: "8"
        memory: "16Gi"
      limits:
        cpu: "8"
        memory: "16Gi"

kubectl apply -f /tmp/pod-over-request.yaml
kubectl get pod over-request

定位

kubectl describe pod over-request
# 预期：Events 出现 FailedScheduling / Insufficient cpu

修复（降低请求）

kubectl delete pod over-request
sed -i 's/8/1/' /tmp/pod-over-request.yaml
sed -i 's/16Gi/512Mi/' /tmp/pod-over-request.yaml
kubectl apply -f /tmp/pod-over-request.yaml
kubectl get pod over-request -w

复盘
- 资源请求应与节点容量匹配；考虑启用 HPA/CA。

案例2：Taints 导致无法调度#

现象：Events 提示 node(s) had taint {key: value}

复现

kubectl taint nodes <node> dedicated=infra:NoSchedule

修复方式一：添加容忍

# /tmp/pod-toleration.yaml
apiVersion: v1
kind: Pod
metadata:
  name: tolerate-infra
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "infra"
    effect: "NoSchedule"
  containers:
  - name: app
    image: nginx:1.25

kubectl apply -f /tmp/pod-toleration.yaml
kubectl get pod tolerate-infra -o wide

修复方式二：移除污点（谨慎）

kubectl taint nodes <node> dedicated=infra:NoSchedule-

案例3：亲和性冲突#

现象：强制 requiredDuringScheduling 无满足节点

复现（强制标签）

# /tmp/pod-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
  name: hard-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: "disk"
            operator: In
            values: ["ssd"]
  containers:
  - name: app
    image: nginx:1.25

修复：降级为软约束

# /tmp/pod-affinity-soft.yaml
apiVersion: v1
kind: Pod
metadata:
  name: soft-affinity
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: "disk"
            operator: In
            values: ["ssd"]
  containers:
  - name: app
    image: nginx:1.25

案例4：HPA 不扩容#

现象：HPA 状态显示 Unknown
排障思路：先检查 Metrics，再检查 requests

复现 HPA（需先安装 Metrics Server）

kubectl create deploy web --image=nginx:1.25
kubectl set resources deploy/web --requests=cpu=100m,memory=128Mi
kubectl autoscale deploy web --cpu-percent=50 --min=1 --max=5

kubectl get hpa
kubectl describe hpa web

若 Unknown，检查指标：

kubectl top pod
kubectl top node
# 若无输出，检查 Metrics Server
kubectl get pods -n kube-system | grep metrics-server
kubectl logs -n kube-system deploy/metrics-server | tail -n 50

触发负载验证扩容

kubectl run -it --rm load --image=busybox \
  -- /bin/sh -c "while true; do wget -q -O- http://web; done"

kubectl get hpa -w

案例5：Evicted（内存压力）#

现象：Pod 被驱逐，节点 MemoryPressure

定位

kubectl describe pod <pod> | grep -A3 -i evicted
kubectl describe node <node> | grep -i MemoryPressure -A2

修复思路
- 提高 requests/limits，减少 BestEffort
- 清理节点磁盘/内存
- 启用 VPA（示例）

# /tmp/vpa.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: web
  updatePolicy:
    updateMode: "Auto"

五、快速排障清单（Checklist）#

[ ] Pod Events 是否有 FailedScheduling/Insufficient
[ ] Node 是否 NotReady/不可调度
[ ] requests 是否超过 Allocatable
[ ] 亲和性/污点是否冲突
[ ] Namespace 配额是否耗尽
[ ] Metrics 是否可用（HPA/VPA）
[ ] 是否存在过严的 PDB/优先级抢占

六、练习题（动手验证）#

资源不足：创建一个请求 4CPU/8Gi 的 Pod，观察 Pending 后调整为 500m/256Mi 并成功调度。
污点与容忍：给节点打 dedicated=infra:NoSchedule，创建 Pod 不带容忍，记录失败事件，再添加容忍并成功调度。
HPA 验证：部署 Nginx，配置 HPA，使用 BusyBox 压测触发扩容，记录扩容时间与最终副本数。
配额限制：为命名空间设置 CPU=1 的 ResourceQuota，创建两个各 600m 的 Pod，观察失败并调整。

七、优化与预防建议#

统一资源 requests/limits 策略，避免“零请求”导致资源争抢。
亲和性约束分级使用，优先 soft 约束。
使用 ResourceQuota 与 LimitRange 进行资源治理。
开启 Cluster Autoscaler，配合 HPA/VPA 提升弹性。
建立调度失败告警与容量预测机制，避免资源耗尽。