AI Platform Engineering — Operations¶
GPU Node Setup¶
Prerequisites¶
Before Kubernetes can schedule GPU workloads, each GPU node requires:
- NVIDIA Driver — kernel module for GPU hardware access
- NVIDIA Container Toolkit — enables container runtimes to access GPUs
- NVIDIA Device Plugin — advertises GPUs to kubelet
- Optionally: NVIDIA GPU Operator — automates all of the above
Manual GPU Node Setup¶
# Verify GPU hardware is detected
lspci | grep -i nvidia
# Check NVIDIA driver installation
nvidia-smi
# Verify NVIDIA Container Toolkit
nvidia-container-cli info
# Deploy NVIDIA Device Plugin as DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
# Verify GPU resources are advertised
kubectl describe node <gpu-node> | grep nvidia.com/gpu
GPU Operator Deployment (Recommended)¶
The GPU Operator automates the full lifecycle:
# Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set migManager.enabled=true \
--set nodeStatusExporter.enabled=true
# Verify all components are running
kubectl get pods -n gpu-operator
# Check GPU resources on nodes
kubectl get nodes -o json | jq '.items[].status.allocatable | select(.["nvidia.com/gpu"])'
Verify GPU Scheduling¶
# Run a test GPU pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-test
image: nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
EOF
# Check pod logs
kubectl logs gpu-test
MIG Configuration¶
Supported GPU Profiles¶
MIG is available on NVIDIA Ampere+ GPUs (A100, A30, H100). Common profiles for A100 80GB:
| Profile | GPU Memory | SMs | GPU Engines | Max Instances |
|---|---|---|---|---|
1g.10gb |
10 GB | 14 | 1 | 7 |
2g.20gb |
20 GB | 28 | 2 | 3 |
3g.40gb |
40 GB | 42 | 3 | 2 |
4g.40gb |
40 GB | 56 | 4 | 1 |
7g.80gb |
80 GB | 98 | 7 | 1 |
Enable and Configure MIG¶
# Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1
# Reboot or reset GPU
sudo nvidia-smi -i 0 --gpu-reset
# List available MIG profiles
nvidia-smi mig -i 0 -lgip
# Create MIG instances
nvidia-smi mig -i 0 -cgi 19,19,14 -C
# Creates: 2x 1g.10gb + 1x 2g.20gb
# Verify MIG instances
nvidia-smi mig -i 0 -lgi
# List compute instances
nvidia-smi mig -i 0 -lci
# Destroy all MIG instances
nvidia-smi mig -i 0 -dci
nvidia-smi mig -i 0 -dgi
MIG with Kubernetes¶
When MIG is enabled, the NVIDIA Device Plugin advertises each MIG instance as a separate resource:
# Node capacity with MIG
nvidia.com/mig-1g.10gb: 2
nvidia.com/mig-2g.20gb: 1
# Pod requesting a specific MIG profile
apiVersion: v1
kind: Pod
metadata:
name: mig-workload
spec:
containers:
- name: inference
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/mig-1g.10gb: 1
GPU Time-Slicing Configuration¶
Enable Run:ai Time-Slicing¶
# Using Helm values
cat <<EOF > values-timeslicing.yaml
clusterConfig:
global:
core:
timeSlicing:
mode: fair # or "strict"
EOF
helm upgrade gpu-operator nvidia/gpu-operator \
-f values-timeslicing.yaml \
--namespace gpu-operator
# Using kubectl patch (runtime)
kubectl patch -n runai runaiconfigs.run.ai/runai \
--type='merge' \
--patch '{"spec":{"global":{"core":{"timeSlicing":{"mode": "fair"}}}}}'
NVIDIA Native Time-Slicing (without Run:ai)¶
# ConfigMap for NVIDIA Device Plugin time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Allow 4 pods per GPU
vLLM Deployment¶
Standalone vLLM Server¶
# Install vLLM
pip install vllm
# Start vLLM server with a model
vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--dtype auto
# Test with curl (OpenAI-compatible API)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128
}'
vLLM on Kubernetes¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-2-7b-chat-hf"
- "--tensor-parallel-size"
- "1"
- "--gpu-memory-utilization"
- "0.9"
- "--max-model-len"
- "4096"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
requests:
memory: 16Gi
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
vLLM Key Parameters¶
| Parameter | Default | Purpose |
|---|---|---|
--tensor-parallel-size |
1 | Number of GPUs for tensor parallelism |
--gpu-memory-utilization |
0.9 | Fraction of GPU memory for KV cache |
--max-model-len |
Model default | Maximum sequence length |
--dtype |
auto | Weight precision (float16, bfloat16, auto) |
--enforce-eager |
false | Disable CUDA graphs (debug mode) |
--max-num-seqs |
256 | Maximum concurrent sequences |
--quantization |
none | Quantization method (awq, gptq, squeezellm) |
Ray Cluster Deployment¶
Ray on Kubernetes (KubeRay)¶
# Install KubeRay operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
helm install kuberay-operator kuberay/kuberay-operator \
--namespace ray-system \
--create-namespace
# Deploy a Ray cluster
cat <<EOF | kubectl apply -f -
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: gpu-cluster
spec:
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.46.0-py311-gpu
ports:
- containerPort: 6379
- containerPort: 8265 # Dashboard
resources:
limits:
cpu: "4"
memory: "8Gi"
workerGroupSpecs:
- replicas: 2
minReplicas: 1
maxReplicas: 4
groupName: gpu-workers
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.46.0-py311-gpu
resources:
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: 1
EOF
# Check Ray cluster status
kubectl get rayclusters
kubectl get pods -l ray.io/cluster=gpu-cluster
Ray Job Submission¶
# Submit a Ray job
ray job submit \
--address http://ray-head-svc:8265 \
--working-dir . \
-- python train.py
# Check job status
ray job status <job-id>
# View job logs
ray job logs <job-id>
Basic Ray Task Example¶
import ray
ray.init()
@ray.remote(num_gpus=1)
def train_on_chunk(data_chunk):
import torch
device = torch.device("cuda")
# Process data chunk on GPU
return len(data_chunk)
# Distribute work across GPUs
chunks = [data[i:i+1000] for i in range(0, len(data), 1000)]
futures = [train_on_chunk.remote(chunk) for chunk in chunks]
results = ray.get(futures)
Batch Scheduling with Volcano¶
Install Volcano¶
# Install Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
# Verify installation
kubectl get pods -n volcano-system
Gang-Scheduled Training Job¶
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-training
spec:
minAvailable: 4 # Gang scheduling: all 4 workers required
schedulerName: volcano
queue: default
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 4
name: worker
template:
spec:
containers:
- name: pytorch-worker
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
command: ["torchrun"]
args:
- "--nproc_per_node=1"
- "--nnodes=4"
- "--node_rank=$(VOLCANO_TASK_INDEX)"
- "train.py"
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: OnFailure
Job Queueing with Kueue¶
Install Kueue¶
# Install Kueue
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.11.0/manifests.yaml
# Verify
kubectl get pods -n kueue-system
Configure Resource Quotas¶
# ClusterQueue with GPU quota
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: gpu-cluster-queue
spec:
namespaceSelector: {}
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: a100
resources:
- name: "nvidia.com/gpu"
nominalQuota: 8
- name: "cpu"
nominalQuota: 64
- name: "memory"
nominalQuota: 256Gi
---
# LocalQueue for team namespace
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: team-ml-queue
namespace: ml-team
spec:
clusterQueue: gpu-cluster-queue
GPU Monitoring¶
DCGM Exporter Metrics¶
The DCGM (Data Center GPU Manager) Exporter runs as part of the GPU Operator and exposes Prometheus metrics:
# Key GPU metrics
DCGM_FI_DEV_GPU_UTIL # GPU utilization (%)
DCGM_FI_DEV_MEM_COPY_UTIL # Memory bandwidth utilization (%)
DCGM_FI_DEV_FB_USED # Framebuffer memory used (MiB)
DCGM_FI_DEV_FB_FREE # Framebuffer memory free (MiB)
DCGM_FI_DEV_GPU_TEMP # GPU temperature (°C)
DCGM_FI_DEV_POWER_USAGE # Power consumption (W)
DCGM_FI_DEV_SM_CLOCK # SM clock frequency (MHz)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL # NVLink bandwidth
DCGM_FI_DEV_PCIE_REPLAY_COUNTER # PCIe replay errors
Essential Monitoring Queries (PromQL)¶
# GPU utilization across cluster
avg(DCGM_FI_DEV_GPU_UTIL) by (gpu, Hostname)
# GPU memory usage percentage
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100
# Idle GPUs (utilization < 5% for 10 minutes)
DCGM_FI_DEV_GPU_UTIL < 5
# GPU temperature alerts
DCGM_FI_DEV_GPU_TEMP > 85
# Pods requesting GPUs
kube_pod_resource_limit{resource="nvidia_com_gpu"} > 0
nvidia-smi Quick Reference¶
# Full GPU status
nvidia-smi
# Continuous monitoring (refresh every 1 second)
nvidia-smi -l 1
# Query specific metrics
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv
# Show running GPU processes
nvidia-smi pmon -s um -d 1
# Check MIG status
nvidia-smi mig -lgi -i 0
# Show NVLink status
nvidia-smi nvlink -s
# Show GPU topology
nvidia-smi topo -m
Troubleshooting¶
GPU Not Visible to Kubernetes¶
# Check if driver is loaded
lsmod | grep nvidia
# Check device plugin pods
kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset
# Check device plugin logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
# Verify extended resources on node
kubectl describe node <node> | grep -A5 "Allocatable"
Pod Stuck in Pending (GPU)¶
# Check if GPU resources are available
kubectl describe node <node> | grep nvidia.com/gpu
# Check events on the pending pod
kubectl describe pod <pod-name> | tail -20
# Common causes:
# - No GPU nodes available
# - All GPUs allocated to other pods
# - Resource request exceeds node capacity
# - Taints/tolerations preventing scheduling
GPU Out of Memory (OOM)¶
# Check GPU memory usage
nvidia-smi
# For vLLM: reduce memory utilization
# --gpu-memory-utilization 0.8 (default 0.9)
# For training: reduce batch size or enable gradient checkpointing
# torch.cuda.empty_cache() to free cached memory
Ray Cluster Issues¶
# Check Ray head logs
kubectl logs <ray-head-pod> -c ray-head
# Access Ray dashboard
kubectl port-forward svc/ray-head-svc 8265:8265
# Check cluster resources
ray status --address http://localhost:8265
# Check autoscaler logs
kubectl logs <ray-head-pod> -c autoscaler
Best Practices¶
GPU Resource Management¶
- Right-size GPU requests — Profile workloads before choosing GPU allocation. Use
nvidia-smi pmonto measure actual utilization. - Use MIG for multi-tenant clusters — Hardware isolation prevents noisy-neighbor issues.
- Enable time-slicing for development — Allow multiple dev workloads to share GPUs; reserve dedicated GPUs for production inference.
- Set GPU memory limits in vLLM — Use
--gpu-memory-utilization 0.85-0.90to leave headroom for CUDA context and spikes. - Monitor GPU utilization continuously — Target >70% utilization for production inference; investigate anything below 50%.
Scheduling¶
- Use Kueue for admission control — Prevent cluster overcommit by queuing jobs that exceed available GPU capacity.
- Use Volcano for distributed training — Gang scheduling prevents partial allocation waste.
- Enable topology awareness — For multi-GPU training, prefer same-node placement to leverage NVLink.
- Set preemption policies — Allow production inference to preempt batch training during capacity pressure.
Inference Optimization¶
- Use vLLM for LLM serving — PagedAttention and continuous batching provide 2-4x throughput improvement.
- Enable tensor parallelism — For models that exceed single GPU memory, split across GPUs with
--tensor-parallel-size. - Quantize models — AWQ or GPTQ quantization can halve memory requirements with minimal quality loss.
- Batch similar requests — Group requests with similar max token lengths for better batching efficiency.