16.6.3 污点与容忍度

污点与容忍度用于控制 Pod 是否可以被调度到特定节点，解决“谁能上哪些节点”的问题。污点（Taint）加在节点上，表示该节点拒绝或驱逐不具备特定容忍度（Toleration）的 Pod；容忍度加在 Pod 上，表示该 Pod 能够容忍某些污点规则。核心是“节点声明限制，Pod 显式接受”。

原理草图#

安装与准备（工具与权限）#

# 1) 安装 kubectl（以 Linux 为例）
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
install -m 0755 kubectl /usr/local/bin/kubectl

# 2) 验证集群访问权限
kubectl version --short
kubectl get nodes

# 3) 需要具备对 Node 的修改权限（RBAC: nodes/patch）
kubectl auth can-i patch nodes

命令解释：
- install -m 0755：将二进制复制到系统 PATH 并设置执行权限。
- kubectl auth can-i：验证当前身份是否能修改节点污点。

核心概念与效果#

NoSchedule：不允许无匹配容忍度的 Pod 调度到节点。
PreferNoSchedule：尽量避免调度，但非强制。
NoExecute：驱逐已在节点上的不容忍 Pod，并禁止新 Pod 调度。

配置示例：给节点打污点并允许特定 Pod#

1) 给节点添加污点#

# 为 node-gpu 添加污点，限制普通工作负载
kubectl taint nodes node-gpu gpu=true:NoSchedule

# 查看节点污点
kubectl describe node node-gpu | grep -A3 Taints

命令解释：
- kubectl taint nodes：为节点增加/删除污点。
- gpu=true:NoSchedule：键=值=效果，要求 Pod 明确容忍。

2) Pod 添加容忍度（YAML）#

# 文件：/opt/k8s/pod-ai-train.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ai-train
  labels:
    app: ai
spec:
  tolerations:
    - key: "gpu"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  containers:
    - name: trainer
      image: busybox:1.36
      command: ["sh", "-c", "echo training on gpu-node; sleep 3600"]

kubectl apply -f /opt/k8s/pod-ai-train.yaml
kubectl get pod ai-train -o wide

预期效果：
- ai-train 能被调度到 node-gpu。
- 无容忍度 Pod 会处于 Pending。

3) NoExecute + 优雅驱逐#

# 文件：/opt/k8s/pod-frontend.yaml
apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  tolerations:
    - key: "maintenance"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 120
  containers:
    - name: web
      image: nginx:1.25

# 维护时给节点添加 NoExecute
kubectl taint nodes node-1 maintenance=true:NoExecute

# 120 秒后未容忍的 Pod 被驱逐，容忍 Pod 可短暂停留

命令解释：
- operator: Exists：只要键匹配即可，不要求值。
- tolerationSeconds：延迟驱逐时间，便于优雅下线。

结合亲和性实现“准入+选择”#

# 文件：/opt/k8s/pod-gpu-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ai-train-2
spec:
  tolerations:
    - key: "gpu"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: "gpu-node"
            operator: In
            values: ["true"]
  containers:
    - name: trainer
      image: busybox:1.36
      command: ["sh", "-c", "echo ai on gpu; sleep 3600"]

说明：
- 污点先“挡住不合格 Pod”，亲和性再“选择合格节点”。

排错与运维检查#

# 1) Pod Pending 排查
kubectl describe pod <pod-name>
# 关注 Events 中是否出现 "node(s) had taint" 或 "no nodes available"

# 2) 查看节点污点是否生效
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# 3) 检查系统 DaemonSet 是否具备容忍度
kubectl -n kube-system get ds
kubectl -n kube-system describe ds <daemonset-name>

# 4) 临时移除污点进行验证
kubectl taint nodes node-gpu gpu=true:NoSchedule-

排错要点：
- 先看 Pod 事件，再看节点污点。
- NoExecute 会驱逐现有 Pod，需注意业务影响。
- DaemonSet 无容忍度会导致系统组件无法落到关键节点。

练习#

给 node-1 添加 env=prod:NoSchedule 污点，创建一个无容忍度的 Pod，观察其 Pending 事件。
为该 Pod 增加 tolerations 后重新部署，验证可被调度到 node-1。
为 node-2 添加 maintenance=true:NoExecute，设置 tolerationSeconds: 30，观察 30 秒后的驱逐行为。
使用 PreferNoSchedule 比较调度差异，记录调度到该节点的比例变化。