gpu-operator

更新时间:2025年9月14日 11:57 浏览:74

官网说明

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html

gpu operator 包含了

  • 设备扩展(nvidia-device-plugin)
  • 节点信息发现(NFD)
  • GPU信息发现(GFD)
  • 驱动管理(nvidia-driver)
  • 容器运行时管理(nvidia-container-toolkit)
  • 显卡切分(nvidia-mig-manager)

等一系列组件

helm 安装 gpu operator

官网 helm 安装说明
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html


helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

helm repo update

helm install \
  --wait \
  --generate-name \
  -n gpu-operator \
  --create-namespace \
  nvidia/gpu-operator \
  --version=v25.3.1 \
  --set driver.enabled=false \
  --set mig.strategy=mixed

# --set driver.enabled=false \
# 跳过安装显卡驱动,适用于已预装了显卡驱动

# 删除
# helm delete -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}')

配置文件

custom-mig-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      h200-149-v2:
        - devices: [4]
          mig-enabled: true
          mig-devices:
            "2g.35gb": 2
            "3g.71gb": 1
        - devices: [5]
          mig-enabled: true
          mig-devices:
            "1g.18gb": 2
            "2g.35gb": 1
            "3g.71gb": 1
        - devices: [0,1,2,3,6,7]
          mig-enabled: false

启用配置

kubectl apply -n gpu-operator -f custom-mig-config.yaml

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
    --type='json' \
    -p='[{"op":"replace", "path":"/spec/migManager/config/name", "value":"custom-mig-config"}]'

kubectl label nodes h200-149 nvidia.com/mig.config=h200-149 --overwrite

查看配置

kubectl get node h200-149 -o=jsonpath='{.metadata.labels}' | jq .
kubectl describe node h200-149 | grep nvidia.com/mig

会自动给节点打上类似以下格式的标签

"nvidia.com/mig-1g.18gb.count": "2",
  "nvidia.com/mig-1g.18gb.engines.copy": "1",
  "nvidia.com/mig-1g.18gb.engines.decoder": "1",
  "nvidia.com/mig-1g.18gb.engines.encoder": "0",
  "nvidia.com/mig-1g.18gb.engines.jpeg": "1",
  "nvidia.com/mig-1g.18gb.engines.ofa": "0",
  "nvidia.com/mig-1g.18gb.memory": "16384",
  "nvidia.com/mig-1g.18gb.multiprocessors": "16",
  "nvidia.com/mig-1g.18gb.product": "NVIDIA-H200-MIG-1g.18gb",
  "nvidia.com/mig-1g.18gb.replicas": "1",
  "nvidia.com/mig-1g.18gb.sharing-strategy": "none",
  "nvidia.com/mig-1g.18gb.slices.ci": "1",
  "nvidia.com/mig-1g.18gb.slices.gi": "1",
  "nvidia.com/mig-2g.35gb.count": "3",
  "nvidia.com/mig-2g.35gb.engines.copy": "2",
  "nvidia.com/mig-2g.35gb.engines.decoder": "2",
  "nvidia.com/mig-2g.35gb.engines.encoder": "0",
  "nvidia.com/mig-2g.35gb.engines.jpeg": "2",
  "nvidia.com/mig-2g.35gb.engines.ofa": "0",
  "nvidia.com/mig-2g.35gb.memory": "33280",
  "nvidia.com/mig-2g.35gb.multiprocessors": "32",
  "nvidia.com/mig-2g.35gb.product": "NVIDIA-H200-MIG-2g.35gb",
  "nvidia.com/mig-2g.35gb.replicas": "1",
  "nvidia.com/mig-2g.35gb.sharing-strategy": "none",
  "nvidia.com/mig-2g.35gb.slices.ci": "2",
  "nvidia.com/mig-2g.35gb.slices.gi": "2",
  "nvidia.com/mig-3g.71gb.count": "2",
  "nvidia.com/mig-3g.71gb.engines.copy": "3",
  "nvidia.com/mig-3g.71gb.engines.decoder": "3",
  "nvidia.com/mig-3g.71gb.engines.encoder": "0",
  "nvidia.com/mig-3g.71gb.engines.jpeg": "3",
  "nvidia.com/mig-3g.71gb.engines.ofa": "0",
  "nvidia.com/mig-3g.71gb.memory": "71424",
  "nvidia.com/mig-3g.71gb.multiprocessors": "60",
  "nvidia.com/mig-3g.71gb.product": "NVIDIA-H200-MIG-3g.71gb",
  "nvidia.com/mig-3g.71gb.replicas": "1",
  "nvidia.com/mig-3g.71gb.sharing-strategy": "none",
  "nvidia.com/mig-3g.71gb.slices.ci": "3",
  "nvidia.com/mig-3g.71gb.slices.gi": "3",
  "nvidia.com/mig.capable": "true",
  "nvidia.com/mig.config": "h200-149-v2",
  "nvidia.com/mig.config.state": "success",
  "nvidia.com/mig.strategy": "mixed",
  "nvidia.com/mps.capable": "false",
  "nvidia.com/vgpu.present": "false"

部署应用时,在 deployment 中指定 mig 显卡

resources:
  requests:
    nvidia.com/mig-3g.71gb: 1
  limits:
    nvidia.com/mig-3g.71gb: 1

显卡切分策略

# 查看 0 号显卡支持的切分
nvidia-smi mig -i 0 -lgip

# 查看所有显卡
nvidia-smi mig -lgip

H200 示例:
mig-lgip.png

切分示例:

  • “1g.18gb”: 2
  • “2g.35gb”: 1
  • “3g.71gb”: 1

切分原则:

  • 简单计算 - 点号前面的 1g, 2g, 3g 表示几组计算单元,总和不超过 7 即可,
  • 精确计算 - 确保 各种计算单元(Memory/SM/CE/DEC/JPEG/ENC/OFA)的总和不超过 最大的那个(7b.141.gb)

常见问题

报错:

Failed to pull image "registry.k8s.io/nfd/node-feature-discovery:v0.17.3": failed to pull and unpack image "registry.k8s.io/nfd/node-feature-discovery:v0.17.3": failed to resolve reference "registry.k8s.io/nfd/node-feature-discovery:v0.17.3": failed to do request: Head "https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/nfd/node-feature-discovery/manifests/v0.17.3": Service Unavailable

国外镜像拉不下来,使用 m.daocloud.io/ 中转

处理:

ctr -n k8s.io i pull m.daocloud.io/registry.k8s.io/nfd/node-feature-discovery:v0.17.3 && \
ctr -n k8s.io i tag m.daocloud.io/registry.k8s.io/nfd/node-feature-discovery:v0.17.3 registry.k8s.io/nfd/node-feature-discovery:v0.17.3
导航