AI Platform Engineering — Operations¶

GPU Node Setup¶

Prerequisites¶

Before Kubernetes can schedule GPU workloads, each GPU node requires:

NVIDIA Driver — kernel module for GPU hardware access
NVIDIA Container Toolkit — enables container runtimes to access GPUs
NVIDIA Device Plugin — advertises GPUs to kubelet
Optionally: NVIDIA GPU Operator — automates all of the above

Manual GPU Node Setup¶

# Verify GPU hardware is detected
lspci | grep -i nvidia

# Check NVIDIA driver installation
nvidia-smi

# Verify NVIDIA Container Toolkit
nvidia-container-cli info

# Deploy NVIDIA Device Plugin as DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

# Verify GPU resources are advertised
kubectl describe node <gpu-node> | grep nvidia.com/gpu

GPU Operator Deployment (Recommended)¶

The GPU Operator automates the full lifecycle:

# Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=true \
  --set nodeStatusExporter.enabled=true

# Verify all components are running
kubectl get pods -n gpu-operator

# Check GPU resources on nodes
kubectl get nodes -o json | jq '.items[].status.allocatable | select(.["nvidia.com/gpu"])'

Verify GPU Scheduling¶

# Run a test GPU pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-test
      image: nvidia/cuda:12.4.1-base-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

# Check pod logs
kubectl logs gpu-test

MIG Configuration¶

Supported GPU Profiles¶

MIG is available on NVIDIA Ampere+ GPUs (A100, A30, H100). Common profiles for A100 80GB:

Profile	GPU Memory	SMs	GPU Engines	Max Instances
`1g.10gb`	10 GB	14	1	7
`2g.20gb`	20 GB	28	2	3
`3g.40gb`	40 GB	42	3	2
`4g.40gb`	40 GB	56	4	1
`7g.80gb`	80 GB	98	7	1

Enable and Configure MIG¶

# Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1

# Reboot or reset GPU
sudo nvidia-smi -i 0 --gpu-reset

# List available MIG profiles
nvidia-smi mig -i 0 -lgip

# Create MIG instances
nvidia-smi mig -i 0 -cgi 19,19,14 -C
# Creates: 2x 1g.10gb + 1x 2g.20gb

# Verify MIG instances
nvidia-smi mig -i 0 -lgi

# List compute instances
nvidia-smi mig -i 0 -lci

# Destroy all MIG instances
nvidia-smi mig -i 0 -dci
nvidia-smi mig -i 0 -dgi

MIG with Kubernetes¶

When MIG is enabled, the NVIDIA Device Plugin advertises each MIG instance as a separate resource:

# Node capacity with MIG
nvidia.com/mig-1g.10gb: 2
nvidia.com/mig-2g.20gb: 1

# Pod requesting a specific MIG profile
apiVersion: v1
kind: Pod
metadata:
  name: mig-workload
spec:
  containers:
    - name: inference
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1

GPU Time-Slicing Configuration¶

Enable Run:ai Time-Slicing¶

# Using Helm values
cat <<EOF > values-timeslicing.yaml
clusterConfig:
  global:
    core:
      timeSlicing:
        mode: fair  # or "strict"
EOF

helm upgrade gpu-operator nvidia/gpu-operator \
  -f values-timeslicing.yaml \
  --namespace gpu-operator

# Using kubectl patch (runtime)
kubectl patch -n runai runaiconfigs.run.ai/runai \
  --type='merge' \
  --patch '{"spec":{"global":{"core":{"timeSlicing":{"mode": "fair"}}}}}'

NVIDIA Native Time-Slicing (without Run:ai)¶

# ConfigMap for NVIDIA Device Plugin time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # Allow 4 pods per GPU

vLLM Deployment¶

Standalone vLLM Server¶

# Install vLLM
pip install vllm

# Start vLLM server with a model
vllm serve meta-llama/Llama-2-7b-chat-hf \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --dtype auto

# Test with curl (OpenAI-compatible API)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

vLLM on Kubernetes¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-2-7b-chat-hf"
            - "--tensor-parallel-size"
            - "1"
            - "--gpu-memory-utilization"
            - "0.9"
            - "--max-model-len"
            - "4096"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
            requests:
              memory: 16Gi
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

vLLM Key Parameters¶

Parameter	Default	Purpose
`--tensor-parallel-size`	1	Number of GPUs for tensor parallelism
`--gpu-memory-utilization`	0.9	Fraction of GPU memory for KV cache
`--max-model-len`	Model default	Maximum sequence length
`--dtype`	auto	Weight precision (float16, bfloat16, auto)
`--enforce-eager`	false	Disable CUDA graphs (debug mode)
`--max-num-seqs`	256	Maximum concurrent sequences
`--quantization`	none	Quantization method (awq, gptq, squeezellm)

Ray Cluster Deployment¶

Ray on Kubernetes (KubeRay)¶

# Install KubeRay operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

helm install kuberay-operator kuberay/kuberay-operator \
  --namespace ray-system \
  --create-namespace

# Deploy a Ray cluster
cat <<EOF | kubectl apply -f -
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: gpu-cluster
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: "0.0.0.0"
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray-ml:2.46.0-py311-gpu
            ports:
              - containerPort: 6379
              - containerPort: 8265  # Dashboard
            resources:
              limits:
                cpu: "4"
                memory: "8Gi"
  workerGroupSpecs:
    - replicas: 2
      minReplicas: 1
      maxReplicas: 4
      groupName: gpu-workers
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-worker
              image: rayproject/ray-ml:2.46.0-py311-gpu
              resources:
                limits:
                  cpu: "4"
                  memory: "16Gi"
                  nvidia.com/gpu: 1
EOF

# Check Ray cluster status
kubectl get rayclusters
kubectl get pods -l ray.io/cluster=gpu-cluster

Ray Job Submission¶

# Submit a Ray job
ray job submit \
  --address http://ray-head-svc:8265 \
  --working-dir . \
  -- python train.py

# Check job status
ray job status <job-id>

# View job logs
ray job logs <job-id>

Basic Ray Task Example¶

import ray

ray.init()

@ray.remote(num_gpus=1)
def train_on_chunk(data_chunk):
    import torch
    device = torch.device("cuda")
    # Process data chunk on GPU
    return len(data_chunk)

# Distribute work across GPUs
chunks = [data[i:i+1000] for i in range(0, len(data), 1000)]
futures = [train_on_chunk.remote(chunk) for chunk in chunks]
results = ray.get(futures)

Batch Scheduling with Volcano¶

Install Volcano¶

# Install Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

# Verify installation
kubectl get pods -n volcano-system

Gang-Scheduled Training Job¶

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 4  # Gang scheduling: all 4 workers required
  schedulerName: volcano
  queue: default
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 4
      name: worker
      template:
        spec:
          containers:
            - name: pytorch-worker
              image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
              command: ["torchrun"]
              args:
                - "--nproc_per_node=1"
                - "--nnodes=4"
                - "--node_rank=$(VOLCANO_TASK_INDEX)"
                - "train.py"
              resources:
                limits:
                  nvidia.com/gpu: 1
          restartPolicy: OnFailure

Job Queueing with Kueue¶

Install Kueue¶

# Install Kueue
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.11.0/manifests.yaml

# Verify
kubectl get pods -n kueue-system

Configure Resource Quotas¶

# ClusterQueue with GPU quota
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-cluster-queue
spec:
  namespaceSelector: {}
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: a100
          resources:
            - name: "nvidia.com/gpu"
              nominalQuota: 8
            - name: "cpu"
              nominalQuota: 64
            - name: "memory"
              nominalQuota: 256Gi
---
# LocalQueue for team namespace
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: team-ml-queue
  namespace: ml-team
spec:
  clusterQueue: gpu-cluster-queue

GPU Monitoring¶

DCGM Exporter Metrics¶

The DCGM (Data Center GPU Manager) Exporter runs as part of the GPU Operator and exposes Prometheus metrics:

# Key GPU metrics
DCGM_FI_DEV_GPU_UTIL          # GPU utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL     # Memory bandwidth utilization (%)
DCGM_FI_DEV_FB_USED           # Framebuffer memory used (MiB)
DCGM_FI_DEV_FB_FREE           # Framebuffer memory free (MiB)
DCGM_FI_DEV_GPU_TEMP          # GPU temperature (°C)
DCGM_FI_DEV_POWER_USAGE       # Power consumption (W)
DCGM_FI_DEV_SM_CLOCK          # SM clock frequency (MHz)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL  # NVLink bandwidth
DCGM_FI_DEV_PCIE_REPLAY_COUNTER    # PCIe replay errors

Essential Monitoring Queries (PromQL)¶

# GPU utilization across cluster
avg(DCGM_FI_DEV_GPU_UTIL) by (gpu, Hostname)

# GPU memory usage percentage
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100

# Idle GPUs (utilization < 5% for 10 minutes)
DCGM_FI_DEV_GPU_UTIL < 5

# GPU temperature alerts
DCGM_FI_DEV_GPU_TEMP > 85

# Pods requesting GPUs
kube_pod_resource_limit{resource="nvidia_com_gpu"} > 0

nvidia-smi Quick Reference¶

# Full GPU status
nvidia-smi

# Continuous monitoring (refresh every 1 second)
nvidia-smi -l 1

# Query specific metrics
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv

# Show running GPU processes
nvidia-smi pmon -s um -d 1

# Check MIG status
nvidia-smi mig -lgi -i 0

# Show NVLink status
nvidia-smi nvlink -s

# Show GPU topology
nvidia-smi topo -m

Troubleshooting¶

GPU Not Visible to Kubernetes¶

# Check if driver is loaded
lsmod | grep nvidia

# Check device plugin pods
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset

# Check device plugin logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

# Verify extended resources on node
kubectl describe node <node> | grep -A5 "Allocatable"

Pod Stuck in Pending (GPU)¶

# Check if GPU resources are available
kubectl describe node <node> | grep nvidia.com/gpu

# Check events on the pending pod
kubectl describe pod <pod-name> | tail -20

# Common causes:
# - No GPU nodes available
# - All GPUs allocated to other pods
# - Resource request exceeds node capacity
# - Taints/tolerations preventing scheduling

GPU Out of Memory (OOM)¶

# Check GPU memory usage
nvidia-smi

# For vLLM: reduce memory utilization
# --gpu-memory-utilization 0.8 (default 0.9)

# For training: reduce batch size or enable gradient checkpointing
# torch.cuda.empty_cache() to free cached memory

Ray Cluster Issues¶

# Check Ray head logs
kubectl logs <ray-head-pod> -c ray-head

# Access Ray dashboard
kubectl port-forward svc/ray-head-svc 8265:8265

# Check cluster resources
ray status --address http://localhost:8265

# Check autoscaler logs
kubectl logs <ray-head-pod> -c autoscaler

Best Practices¶

GPU Resource Management¶

Right-size GPU requests — Profile workloads before choosing GPU allocation. Use nvidia-smi pmon to measure actual utilization.
Use MIG for multi-tenant clusters — Hardware isolation prevents noisy-neighbor issues.
Enable time-slicing for development — Allow multiple dev workloads to share GPUs; reserve dedicated GPUs for production inference.
Set GPU memory limits in vLLM — Use --gpu-memory-utilization 0.85-0.90 to leave headroom for CUDA context and spikes.
Monitor GPU utilization continuously — Target >70% utilization for production inference; investigate anything below 50%.

Scheduling¶

Use Kueue for admission control — Prevent cluster overcommit by queuing jobs that exceed available GPU capacity.
Use Volcano for distributed training — Gang scheduling prevents partial allocation waste.
Enable topology awareness — For multi-GPU training, prefer same-node placement to leverage NVLink.
Set preemption policies — Allow production inference to preempt batch training during capacity pressure.

Inference Optimization¶

Use vLLM for LLM serving — PagedAttention and continuous batching provide 2-4x throughput improvement.
Enable tensor parallelism — For models that exceed single GPU memory, split across GPUs with --tensor-parallel-size.
Quantize models — AWQ or GPTQ quantization can halve memory requirements with minimal quality loss.
Batch similar requests — Group requests with similar max token lengths for better batching efficiency.