AI Platform Engineering — Architecture¶
GPU Discovery in Kubernetes¶
Kubernetes does not natively understand GPU hardware. A multi-layer discovery path makes GPUs visible as schedulable resources:
GPU Hardware
|
NVIDIA Driver (OS-level)
|
Device Plugin (kubelet integration)
|
Kubernetes Node (resource advertised)
|
Pod Requests GPU
Step 1: Physical GPU + Driver¶
The physical GPU is attached to a worker node. The operating system exposes it through vendor-specific drivers. For NVIDIA GPUs, the NVIDIA driver must be installed on the node before the OS can interact with the hardware.
Step 2: Kubernetes Awareness Gap¶
Even with the driver installed, Kubernetes remains unaware of the GPU. Kubernetes natively understands three resource types:
- CPU (
cpu) - Memory (
memory) - Ephemeral storage (
ephemeral-storage)
GPUs must be explicitly registered through the Device Plugin API.
Step 3: Device Plugin Registration¶
A Device Plugin is a Kubernetes extension that advertises specialized hardware to the kubelet. The NVIDIA Device Plugin runs as a DaemonSet on each GPU node and reports available GPUs.
Once registered, Kubernetes sees the resource:
Key Insight
Kubernetes does not schedule GPUs because it understands GPUs. It schedules GPUs because a Device Plugin exposes them as generic extended resources. The same mechanism works for TPUs, FPGAs, SmartNICs, and any other accelerator with a Device Plugin implementation.
Step 4: Pod GPU Requests¶
After registration, workloads request GPUs identically to CPU and memory:
apiVersion: v1
kind: Pod
metadata:
name: gpu-inference
spec:
containers:
- name: model-server
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1
The scheduler matches resource requests with available resources — it has no knowledge of whether the workload is running an LLM, image generator, or training job.
Device Plugin Architecture¶
sequenceDiagram
participant GPU as GPU Hardware
participant Driver as NVIDIA Driver
participant DP as NVIDIA Device Plugin
participant Kubelet as kubelet
participant Scheduler as kube-scheduler
participant Pod as Pod
GPU->>Driver: Hardware attached
Driver->>DP: GPU devices available
DP->>Kubelet: Register via gRPC<br/>ListAndWatch(nvidia.com/gpu: 4)
Kubelet->>Scheduler: Node capacity updated
Pod->>Scheduler: Request nvidia.com/gpu: 1
Scheduler->>Kubelet: Schedule Pod on GPU node
Kubelet->>DP: Allocate(deviceID)
DP->>Pod: Mount GPU device + env vars
The Device Plugin communicates with the kubelet via a gRPC interface at /var/lib/kubelet/device-plugins/. The plugin must handle kubelet restarts by monitoring socket deletion and re-registering. The Device Plugin API supports:
ListAndWatch— advertises available devices and reports health changesAllocate— provisions device access for containers (device nodes, environment variables, mounts, CDI device names)- Health monitoring — marks devices as unhealthy when failures are detected, reducing the node's allocatable count
Since Kubernetes v1.36 (beta), allocatedResourcesStatus in pod status reports per-device health information including error details and failure reasons.
NVIDIA GPU Operator¶
The NVIDIA GPU Operator automates the full GPU lifecycle on Kubernetes nodes:
| Component | Purpose |
|---|---|
| NVIDIA Driver | Kernel-level GPU access |
| NVIDIA Device Plugin | GPU advertisement to Kubernetes |
| NVIDIA Container Toolkit | Container runtime GPU integration |
| DCGM Exporter | GPU metrics for Prometheus |
| Node Feature Discovery | Labels nodes with GPU properties |
| MIG Manager | Multi-Instance GPU partition management |
| GPU Feature Discovery | Exposes GPU model, memory, driver version as node labels |
This eliminates manual driver installation, device plugin deployment, and monitoring setup across the cluster.
GPU Utilization and Resource Fragmentation¶
The Core Problem¶
Standard Kubernetes GPU allocation is binary — a pod requests one or more whole GPUs, and each GPU is allocated exclusively:
Pod A → 1 GPU Requested → 1 GPU Allocated (80 GB)
Actual usage: 10 GB memory, 20% compute
Waste: 70 GB memory, 80% compute
With CPUs, Kubernetes efficiently bin-packs multiple pods onto a single node. GPUs traditionally do not support this — one pod per GPU, regardless of actual utilization.
Cost Impact¶
Consider a cluster with 8x NVIDIA A100 GPUs where every workload uses only 25% of each GPU:
- Effective utilization: 2 GPUs worth of useful work
- Paid for: 8 GPUs
- Waste: 75% of GPU investment
At $30,000+ per A100 GPU, this translates to ~$180,000 in wasted capacity.
GPU Sharing Strategies¶
Three primary techniques address GPU underutilization:
Time-Slicing¶
Multiple workloads take turns using the same GPU, analogous to CPU time-sharing. NVIDIA's Run:ai implementation provides two modes:
| Mode | Behavior | K8s Mapping |
|---|---|---|
| Strict | Each workload gets exactly its requested GPU compute fraction | gpu-compute-request = gpu-compute-limit = gpu-fraction |
| Fair | Each workload gets at least its fraction, plus unused slices from idle workloads | gpu-compute-request = gpu-fraction, gpu-compute-limit = 1.0 |
Time-slicing operates on a plan/lease cycle. Default configuration:
- Lease time: 250ms (exclusive GPU access per workload)
- Granularity: 5% precision
- Plan (cycle) time: 250ms / 0.05 = 5000ms (5 seconds)
A workload requesting gpu-fraction=0.5 gets 2.5s of runtime per 5s cycle.
Trade-offs
Decreasing lease time makes time-slicing less accurate. Increasing lease time improves accuracy but reduces workload responsiveness. Context switching between workloads adds overhead.
Multi-Instance GPU (MIG)¶
Available on NVIDIA Ampere+ GPUs (A100, H100). MIG partitions a single GPU into up to 7 isolated GPU Instances, each with dedicated:
- Streaming Multiprocessors (SMs)
- GPU engines (copy engines, decoders)
- L2 cache banks
- Memory controllers
- DRAM address busses
Example partitioning of an 80 GB A100:
Full A100 (80 GB)
├── MIG Instance 1: 10 GB (1g.10gb)
├── MIG Instance 2: 10 GB (1g.10gb)
├── MIG Instance 3: 20 GB (2g.20gb)
└── MIG Instance 4: 40 GB (4g.40gb)
Each instance provides hardware-level isolation — one workload cannot impact the L2 cache or DRAM bandwidth of another. This makes MIG suitable for multi-tenant environments where QoS guarantees are required.
MIG supports:
- Bare-metal and containers
- GPU passthrough virtualization
- vGPU on supported hypervisors
Continuous Batching (Inference)¶
Instead of processing inference requests one-by-one, the serving engine combines multiple requests into a single GPU execution. vLLM's continuous batching dynamically adds new requests as older ones complete, keeping the GPU busy continuously rather than waiting for full batch formation.
AI Workload Scheduling¶
Why Standard Kubernetes Scheduling Falls Short¶
Traditional applications are loosely coupled — components can start independently and tolerate staggered scheduling. AI training jobs have fundamentally different requirements.
Gang Scheduling¶
Distributed training jobs require all resources allocated simultaneously:
Training Job (requires 8 GPUs)
├── Worker 0: GPU 0
├── Worker 1: GPU 1
├── Worker 2: GPU 2
├── Worker 3: GPU 3
├── Worker 4: GPU 4
├── Worker 5: GPU 5
├── Worker 6: GPU 6
└── Worker 7: GPU 7
If only 6 GPUs are available, the job cannot start. Partial allocation wastes resources — workers wait indefinitely for the remaining GPUs, blocking other jobs.
Gang scheduling rule: Either schedule all required resources together, or schedule none of them.
Topology Awareness¶
GPU placement affects training performance significantly. Same-node GPUs communicate via high-speed interconnects (NVLink at 900 GB/s on H100), while cross-node GPUs use network fabric (InfiniBand at ~400 Gb/s).
graph LR
subgraph "Node A (Fast: NVLink)"
GPU0["GPU 0"]
GPU1["GPU 1"]
GPU2["GPU 2"]
GPU3["GPU 3"]
GPU0 <--> GPU1
GPU1 <--> GPU2
GPU2 <--> GPU3
end
subgraph "Node B (Fast: NVLink)"
GPU4["GPU 4"]
GPU5["GPU 5"]
GPU6["GPU 6"]
GPU7["GPU 7"]
GPU4 <--> GPU5
GPU5 <--> GPU6
GPU6 <--> GPU7
end
GPU3 <-.->|"Slower: Network Fabric"| GPU4
A topology-aware scheduler prefers placing all GPUs on the same node when possible, falling back to nodes with the best inter-node connectivity.
Scheduling Tools Comparison¶
| Tool | Type | Key Capabilities |
|---|---|---|
| Volcano | Kubernetes-native batch scheduler | Gang scheduling, queue management, fair-share, priority-based preemption |
| Kueue | Kubernetes SIG job queueing | Admission control, resource quotas, job queuing, cluster queue management |
| Kubeflow Training Operator | Distributed training CRD | PyTorchJob, TFJob, XGBoostJob — works with Volcano for gang scheduling |
| NVIDIA GPU Operator | GPU lifecycle manager | Driver management, device plugins, DCGM metrics, MIG management |
Ray — Distributed Compute Framework¶
The Two-Layer Model¶
graph TB
subgraph "Application Layer"
User["User / Application"]
RayDriver["Ray Driver"]
end
subgraph "Ray Layer (Computation Scheduling)"
RayHead["Ray Head Node<br/>(GCS, Autoscaler, Dashboard)"]
RayWorker1["Ray Worker 1"]
RayWorker2["Ray Worker 2"]
RayWorker3["Ray Worker 3"]
end
subgraph "Kubernetes Layer (Infrastructure Scheduling)"
K8s["kube-scheduler"]
Pod1["Pod (Head)"]
Pod2["Pod (Worker)"]
Pod3["Pod (Worker)"]
Pod4["Pod (Worker)"]
end
User --> RayDriver
RayDriver --> RayHead
RayHead --> RayWorker1
RayHead --> RayWorker2
RayHead --> RayWorker3
K8s --> Pod1
K8s --> Pod2
K8s --> Pod3
K8s --> Pod4
Pod1 -.- RayHead
Pod2 -.- RayWorker1
Pod3 -.- RayWorker2
Pod4 -.- RayWorker3
Kubernetes schedules infrastructure (which node should this pod run on?). Ray schedules computation (which worker executes which task? how are results collected?).
Ray Architecture¶
| Component | Role |
|---|---|
| Head Node | Runs GCS (Global Control Service), autoscaler, Ray dashboard. Also schedules tasks like worker nodes. |
| Worker Nodes | Execute Ray tasks and actors. Participate in distributed object storage. |
| Autoscaler | Scales worker nodes based on task/actor resource requests (not CPU/memory metrics). |
| GCS | Central metadata store for cluster state, actor locations, and resource availability. |
Tasks vs Actors¶
| Dimension | Tasks | Actors |
|---|---|---|
| State | Stateless | Stateful |
| Lifecycle | Run once, return result | Long-lived, handle multiple requests |
| Use case | Data processing, hyperparameter search | Model serving, stateful computation |
| Invocation | function.remote(args) |
actor.method.remote(args) |
Tasks enable embarrassingly parallel workloads (data processing, hyperparameter tuning). Actors enable stateful services (model serving, game environments, RL training).
Ray Libraries¶
| Library | Purpose |
|---|---|
| Ray Train | Distributed training (PyTorch, TensorFlow, XGBoost) |
| Ray Tune | Hyperparameter optimization |
| Ray Serve | Model serving and composition |
| Ray Data | Distributed data processing |
| Ray RLlib | Reinforcement learning |
Ray on Kubernetes (KubeRay)¶
KubeRay provides Kubernetes CRDs for managing Ray clusters:
RayCluster— manages head and worker podsRayJob— submits jobs to a Ray clusterRayService— manages Ray Serve deployments with zero-downtime upgrades
The autoscaler in KubeRay v2 runs as a sidecar container in the head pod, scaling worker pods based on pending Ray task/actor resource demands.
vLLM — Inference Engine Architecture¶
Why Naive Model Serving Fails at Scale¶
A naive inference server processes requests sequentially or in static batches:
Request 1 → GPU → Response 1
Request 2 → GPU → Response 2 (waits for Request 1)
Request 3 → GPU → Response 3 (waits for Request 2)
With 100 concurrent users, GPU utilization stays low because the GPU cannot exploit its parallel architecture. This is the GPU utilization problem from Day 3 applied to inference serving.
PagedAttention¶
The key innovation in vLLM. Traditional serving systems allocate GPU memory for the KV cache in large, contiguous chunks. For variable-length sequences, this causes:
- Internal fragmentation — allocated blocks larger than needed
- External fragmentation — free memory scattered in unusable small chunks
- Reservation waste — memory reserved for maximum sequence length even for short sequences
PagedAttention treats the KV cache like virtual memory — memory is managed in fixed-size blocks (pages) that need not be contiguous:
Traditional KV Cache:
[████████████░░░░░░░░] Request 1 (wasted space)
[████████░░░░░░░░░░░░] Request 2 (wasted space)
[░░░░░░░░░░░░░░░░░░░░] Free (fragmented)
PagedAttention:
[████][████][████][██] Request 1 (pages, no waste)
[████][████][██] Request 2 (pages, no waste)
[████][████] Free (reusable pages)
Results:
- Near-zero memory waste
- 2-4x more concurrent requests with the same GPU
- Dynamic memory allocation as sequences grow
Continuous Batching¶
Traditional batching waits for a full batch before processing:
Traditional: Wait → Process Batch → Wait → Process Batch
Continuous: Process ─── Process ─── Process ─── Process
(new requests added as old ones complete)
vLLM's continuous batching adds new requests to the running batch as existing requests finish generating tokens. The GPU stays busy continuously, dramatically improving throughput.
vLLM in the Platform Stack¶
Users
|
vLLM (inference optimization: PagedAttention + continuous batching)
|
Model Weights (loaded into GPU memory)
|
GPU (compute)
In a Kubernetes deployment:
Users
|
Load Balancer / Gateway
|
vLLM Pods (KServe InferenceService or Ray Serve)
|
Kubernetes (scheduling, scaling, health checks)
|
GPU Nodes (NVIDIA Device Plugin, GPU Operator)
vLLM exposes an OpenAI-compatible API, making it a drop-in replacement for existing applications that use the OpenAI API format.
Full Platform Architecture Diagram¶
graph TB
subgraph "Layer 1: AI Applications"
App1["Virtual Assistants"]
App2["Recommendation Systems"]
App3["Fraud Detection"]
App4["Content Generation"]
end
subgraph "Layer 2: AI Platform Services"
subgraph "Data Pipeline"
DP["Data Processing"]
FE["Feature Engineering"]
end
subgraph "Model Lifecycle"
MT["Model Training<br/>(Kubeflow, Ray Train)"]
MR["Model Registry<br/>(MLflow)"]
end
subgraph "Serving & Monitoring"
MS["Model Serving<br/>(KServe, vLLM, Ray Serve)"]
MON["Monitoring<br/>(Prometheus, DCGM)"]
end
end
subgraph "Layer 3: Infrastructure"
subgraph "Orchestration"
K8S["Kubernetes"]
SCHED["Schedulers<br/>(Volcano, Kueue)"]
RAY["Ray Cluster"]
end
subgraph "Compute & Storage"
GPU["GPUs<br/>(A100, H100)"]
CPU["CPUs"]
STORE["Object Storage<br/>(S3, MinIO)"]
end
subgraph "GPU Management"
GPUOP["GPU Operator"]
DEVPLUGIN["Device Plugin"]
MIG["MIG Manager"]
end
end
App1 & App2 & App3 & App4 --> MS
DP --> FE --> MT --> MR --> MS
MS --> MON
MS --> RAY
MT --> RAY
RAY --> K8S
SCHED --> K8S
K8S --> GPU & CPU & STORE
GPUOP --> DEVPLUGIN --> GPU
GPUOP --> MIG --> GPU
Benchmarks and Scale Considerations¶
vLLM Performance Characteristics¶
Based on the PagedAttention paper (arXiv:2309.06180):
- PagedAttention achieves near-zero KV cache waste vs 60-80% waste in naive allocators
- Continuous batching can improve throughput by 2-4x over static batching
- Memory savings translate directly to higher concurrent request capacity
GPU Memory Budget (Inference)¶
For an LLM with P parameters at B bytes per parameter:
Model weights: P × B bytes
KV cache: Variable (grows with context length × batch size)
Overhead: ~10-20% for framework, CUDA context
Example: Llama 2 70B at FP16:
- Model weights: 70B × 2 bytes = 140 GB
- Minimum: 2x A100 80GB (tensor parallelism)
- With KV cache headroom: 4x A100 80GB for production batch sizes
GPU Interconnect Bandwidth¶
| Interconnect | Bandwidth | Use Case |
|---|---|---|
| NVLink (H100) | 900 GB/s | Intra-node GPU communication |
| NVLink (A100) | 600 GB/s | Intra-node GPU communication |
| InfiniBand HDR | 200 Gb/s (25 GB/s) | Inter-node communication |
| InfiniBand NDR | 400 Gb/s (50 GB/s) | Inter-node communication |
| Ethernet (RoCE) | 100-400 Gb/s | Cost-effective inter-node |
Topology-aware scheduling becomes critical when inter-node bandwidth is 10-30x lower than intra-node NVLink bandwidth.