Skip to content

AI Platform Engineering — Architecture

GPU Discovery in Kubernetes

Kubernetes does not natively understand GPU hardware. A multi-layer discovery path makes GPUs visible as schedulable resources:

GPU Hardware
      |
NVIDIA Driver (OS-level)
      |
Device Plugin (kubelet integration)
      |
Kubernetes Node (resource advertised)
      |
Pod Requests GPU

Step 1: Physical GPU + Driver

The physical GPU is attached to a worker node. The operating system exposes it through vendor-specific drivers. For NVIDIA GPUs, the NVIDIA driver must be installed on the node before the OS can interact with the hardware.

Step 2: Kubernetes Awareness Gap

Even with the driver installed, Kubernetes remains unaware of the GPU. Kubernetes natively understands three resource types:

  • CPU (cpu)
  • Memory (memory)
  • Ephemeral storage (ephemeral-storage)

GPUs must be explicitly registered through the Device Plugin API.

Step 3: Device Plugin Registration

A Device Plugin is a Kubernetes extension that advertises specialized hardware to the kubelet. The NVIDIA Device Plugin runs as a DaemonSet on each GPU node and reports available GPUs.

Once registered, Kubernetes sees the resource:

# Node capacity after device plugin registration
nvidia.com/gpu: 4

Key Insight

Kubernetes does not schedule GPUs because it understands GPUs. It schedules GPUs because a Device Plugin exposes them as generic extended resources. The same mechanism works for TPUs, FPGAs, SmartNICs, and any other accelerator with a Device Plugin implementation.

Step 4: Pod GPU Requests

After registration, workloads request GPUs identically to CPU and memory:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-inference
spec:
  containers:
    - name: model-server
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/gpu: 1

The scheduler matches resource requests with available resources — it has no knowledge of whether the workload is running an LLM, image generator, or training job.

Device Plugin Architecture

sequenceDiagram
    participant GPU as GPU Hardware
    participant Driver as NVIDIA Driver
    participant DP as NVIDIA Device Plugin
    participant Kubelet as kubelet
    participant Scheduler as kube-scheduler
    participant Pod as Pod

    GPU->>Driver: Hardware attached
    Driver->>DP: GPU devices available
    DP->>Kubelet: Register via gRPC<br/>ListAndWatch(nvidia.com/gpu: 4)
    Kubelet->>Scheduler: Node capacity updated
    Pod->>Scheduler: Request nvidia.com/gpu: 1
    Scheduler->>Kubelet: Schedule Pod on GPU node
    Kubelet->>DP: Allocate(deviceID)
    DP->>Pod: Mount GPU device + env vars

The Device Plugin communicates with the kubelet via a gRPC interface at /var/lib/kubelet/device-plugins/. The plugin must handle kubelet restarts by monitoring socket deletion and re-registering. The Device Plugin API supports:

  • ListAndWatch — advertises available devices and reports health changes
  • Allocate — provisions device access for containers (device nodes, environment variables, mounts, CDI device names)
  • Health monitoring — marks devices as unhealthy when failures are detected, reducing the node's allocatable count

Since Kubernetes v1.36 (beta), allocatedResourcesStatus in pod status reports per-device health information including error details and failure reasons.

NVIDIA GPU Operator

The NVIDIA GPU Operator automates the full GPU lifecycle on Kubernetes nodes:

Component Purpose
NVIDIA Driver Kernel-level GPU access
NVIDIA Device Plugin GPU advertisement to Kubernetes
NVIDIA Container Toolkit Container runtime GPU integration
DCGM Exporter GPU metrics for Prometheus
Node Feature Discovery Labels nodes with GPU properties
MIG Manager Multi-Instance GPU partition management
GPU Feature Discovery Exposes GPU model, memory, driver version as node labels

This eliminates manual driver installation, device plugin deployment, and monitoring setup across the cluster.


GPU Utilization and Resource Fragmentation

The Core Problem

Standard Kubernetes GPU allocation is binary — a pod requests one or more whole GPUs, and each GPU is allocated exclusively:

Pod A → 1 GPU Requested → 1 GPU Allocated (80 GB)
Actual usage: 10 GB memory, 20% compute
Waste: 70 GB memory, 80% compute

With CPUs, Kubernetes efficiently bin-packs multiple pods onto a single node. GPUs traditionally do not support this — one pod per GPU, regardless of actual utilization.

Cost Impact

Consider a cluster with 8x NVIDIA A100 GPUs where every workload uses only 25% of each GPU:

  • Effective utilization: 2 GPUs worth of useful work
  • Paid for: 8 GPUs
  • Waste: 75% of GPU investment

At $30,000+ per A100 GPU, this translates to ~$180,000 in wasted capacity.

GPU Sharing Strategies

Three primary techniques address GPU underutilization:

Time-Slicing

Multiple workloads take turns using the same GPU, analogous to CPU time-sharing. NVIDIA's Run:ai implementation provides two modes:

Mode Behavior K8s Mapping
Strict Each workload gets exactly its requested GPU compute fraction gpu-compute-request = gpu-compute-limit = gpu-fraction
Fair Each workload gets at least its fraction, plus unused slices from idle workloads gpu-compute-request = gpu-fraction, gpu-compute-limit = 1.0

Time-slicing operates on a plan/lease cycle. Default configuration:

  • Lease time: 250ms (exclusive GPU access per workload)
  • Granularity: 5% precision
  • Plan (cycle) time: 250ms / 0.05 = 5000ms (5 seconds)

A workload requesting gpu-fraction=0.5 gets 2.5s of runtime per 5s cycle.

Trade-offs

Decreasing lease time makes time-slicing less accurate. Increasing lease time improves accuracy but reduces workload responsiveness. Context switching between workloads adds overhead.

Multi-Instance GPU (MIG)

Available on NVIDIA Ampere+ GPUs (A100, H100). MIG partitions a single GPU into up to 7 isolated GPU Instances, each with dedicated:

  • Streaming Multiprocessors (SMs)
  • GPU engines (copy engines, decoders)
  • L2 cache banks
  • Memory controllers
  • DRAM address busses

Example partitioning of an 80 GB A100:

Full A100 (80 GB)
├── MIG Instance 1: 10 GB (1g.10gb)
├── MIG Instance 2: 10 GB (1g.10gb)
├── MIG Instance 3: 20 GB (2g.20gb)
└── MIG Instance 4: 40 GB (4g.40gb)

Each instance provides hardware-level isolation — one workload cannot impact the L2 cache or DRAM bandwidth of another. This makes MIG suitable for multi-tenant environments where QoS guarantees are required.

MIG supports:

  • Bare-metal and containers
  • GPU passthrough virtualization
  • vGPU on supported hypervisors

Continuous Batching (Inference)

Instead of processing inference requests one-by-one, the serving engine combines multiple requests into a single GPU execution. vLLM's continuous batching dynamically adds new requests as older ones complete, keeping the GPU busy continuously rather than waiting for full batch formation.


AI Workload Scheduling

Why Standard Kubernetes Scheduling Falls Short

Traditional applications are loosely coupled — components can start independently and tolerate staggered scheduling. AI training jobs have fundamentally different requirements.

Gang Scheduling

Distributed training jobs require all resources allocated simultaneously:

Training Job (requires 8 GPUs)
├── Worker 0: GPU 0
├── Worker 1: GPU 1
├── Worker 2: GPU 2
├── Worker 3: GPU 3
├── Worker 4: GPU 4
├── Worker 5: GPU 5
├── Worker 6: GPU 6
└── Worker 7: GPU 7

If only 6 GPUs are available, the job cannot start. Partial allocation wastes resources — workers wait indefinitely for the remaining GPUs, blocking other jobs.

Gang scheduling rule: Either schedule all required resources together, or schedule none of them.

Topology Awareness

GPU placement affects training performance significantly. Same-node GPUs communicate via high-speed interconnects (NVLink at 900 GB/s on H100), while cross-node GPUs use network fabric (InfiniBand at ~400 Gb/s).

graph LR
    subgraph "Node A (Fast: NVLink)"
        GPU0["GPU 0"]
        GPU1["GPU 1"]
        GPU2["GPU 2"]
        GPU3["GPU 3"]
        GPU0 <--> GPU1
        GPU1 <--> GPU2
        GPU2 <--> GPU3
    end

    subgraph "Node B (Fast: NVLink)"
        GPU4["GPU 4"]
        GPU5["GPU 5"]
        GPU6["GPU 6"]
        GPU7["GPU 7"]
        GPU4 <--> GPU5
        GPU5 <--> GPU6
        GPU6 <--> GPU7
    end

    GPU3 <-.->|"Slower: Network Fabric"| GPU4

A topology-aware scheduler prefers placing all GPUs on the same node when possible, falling back to nodes with the best inter-node connectivity.

Scheduling Tools Comparison

Tool Type Key Capabilities
Volcano Kubernetes-native batch scheduler Gang scheduling, queue management, fair-share, priority-based preemption
Kueue Kubernetes SIG job queueing Admission control, resource quotas, job queuing, cluster queue management
Kubeflow Training Operator Distributed training CRD PyTorchJob, TFJob, XGBoostJob — works with Volcano for gang scheduling
NVIDIA GPU Operator GPU lifecycle manager Driver management, device plugins, DCGM metrics, MIG management

Ray — Distributed Compute Framework

The Two-Layer Model

graph TB
    subgraph "Application Layer"
        User["User / Application"]
        RayDriver["Ray Driver"]
    end

    subgraph "Ray Layer (Computation Scheduling)"
        RayHead["Ray Head Node<br/>(GCS, Autoscaler, Dashboard)"]
        RayWorker1["Ray Worker 1"]
        RayWorker2["Ray Worker 2"]
        RayWorker3["Ray Worker 3"]
    end

    subgraph "Kubernetes Layer (Infrastructure Scheduling)"
        K8s["kube-scheduler"]
        Pod1["Pod (Head)"]
        Pod2["Pod (Worker)"]
        Pod3["Pod (Worker)"]
        Pod4["Pod (Worker)"]
    end

    User --> RayDriver
    RayDriver --> RayHead
    RayHead --> RayWorker1
    RayHead --> RayWorker2
    RayHead --> RayWorker3

    K8s --> Pod1
    K8s --> Pod2
    K8s --> Pod3
    K8s --> Pod4

    Pod1 -.- RayHead
    Pod2 -.- RayWorker1
    Pod3 -.- RayWorker2
    Pod4 -.- RayWorker3

Kubernetes schedules infrastructure (which node should this pod run on?). Ray schedules computation (which worker executes which task? how are results collected?).

Ray Architecture

Component Role
Head Node Runs GCS (Global Control Service), autoscaler, Ray dashboard. Also schedules tasks like worker nodes.
Worker Nodes Execute Ray tasks and actors. Participate in distributed object storage.
Autoscaler Scales worker nodes based on task/actor resource requests (not CPU/memory metrics).
GCS Central metadata store for cluster state, actor locations, and resource availability.

Tasks vs Actors

Dimension Tasks Actors
State Stateless Stateful
Lifecycle Run once, return result Long-lived, handle multiple requests
Use case Data processing, hyperparameter search Model serving, stateful computation
Invocation function.remote(args) actor.method.remote(args)

Tasks enable embarrassingly parallel workloads (data processing, hyperparameter tuning). Actors enable stateful services (model serving, game environments, RL training).

Ray Libraries

Library Purpose
Ray Train Distributed training (PyTorch, TensorFlow, XGBoost)
Ray Tune Hyperparameter optimization
Ray Serve Model serving and composition
Ray Data Distributed data processing
Ray RLlib Reinforcement learning

Ray on Kubernetes (KubeRay)

KubeRay provides Kubernetes CRDs for managing Ray clusters:

  • RayCluster — manages head and worker pods
  • RayJob — submits jobs to a Ray cluster
  • RayService — manages Ray Serve deployments with zero-downtime upgrades

The autoscaler in KubeRay v2 runs as a sidecar container in the head pod, scaling worker pods based on pending Ray task/actor resource demands.


vLLM — Inference Engine Architecture

Why Naive Model Serving Fails at Scale

A naive inference server processes requests sequentially or in static batches:

Request 1 → GPU → Response 1
Request 2 → GPU → Response 2  (waits for Request 1)
Request 3 → GPU → Response 3  (waits for Request 2)

With 100 concurrent users, GPU utilization stays low because the GPU cannot exploit its parallel architecture. This is the GPU utilization problem from Day 3 applied to inference serving.

PagedAttention

The key innovation in vLLM. Traditional serving systems allocate GPU memory for the KV cache in large, contiguous chunks. For variable-length sequences, this causes:

  • Internal fragmentation — allocated blocks larger than needed
  • External fragmentation — free memory scattered in unusable small chunks
  • Reservation waste — memory reserved for maximum sequence length even for short sequences

PagedAttention treats the KV cache like virtual memory — memory is managed in fixed-size blocks (pages) that need not be contiguous:

Traditional KV Cache:
[████████████░░░░░░░░] Request 1 (wasted space)
[████████░░░░░░░░░░░░] Request 2 (wasted space)
[░░░░░░░░░░░░░░░░░░░░] Free (fragmented)

PagedAttention:
[████][████][████][██] Request 1 (pages, no waste)
[████][████][██]       Request 2 (pages, no waste)
[████][████]           Free (reusable pages)

Results:

  • Near-zero memory waste
  • 2-4x more concurrent requests with the same GPU
  • Dynamic memory allocation as sequences grow

Continuous Batching

Traditional batching waits for a full batch before processing:

Traditional:    Wait → Process Batch → Wait → Process Batch
Continuous:     Process ─── Process ─── Process ─── Process
                (new requests added as old ones complete)

vLLM's continuous batching adds new requests to the running batch as existing requests finish generating tokens. The GPU stays busy continuously, dramatically improving throughput.

vLLM in the Platform Stack

Users
  |
vLLM (inference optimization: PagedAttention + continuous batching)
  |
Model Weights (loaded into GPU memory)
  |
GPU (compute)

In a Kubernetes deployment:

Users
  |
Load Balancer / Gateway
  |
vLLM Pods (KServe InferenceService or Ray Serve)
  |
Kubernetes (scheduling, scaling, health checks)
  |
GPU Nodes (NVIDIA Device Plugin, GPU Operator)

vLLM exposes an OpenAI-compatible API, making it a drop-in replacement for existing applications that use the OpenAI API format.


Full Platform Architecture Diagram

graph TB
    subgraph "Layer 1: AI Applications"
        App1["Virtual Assistants"]
        App2["Recommendation Systems"]
        App3["Fraud Detection"]
        App4["Content Generation"]
    end

    subgraph "Layer 2: AI Platform Services"
        subgraph "Data Pipeline"
            DP["Data Processing"]
            FE["Feature Engineering"]
        end
        subgraph "Model Lifecycle"
            MT["Model Training<br/>(Kubeflow, Ray Train)"]
            MR["Model Registry<br/>(MLflow)"]
        end
        subgraph "Serving & Monitoring"
            MS["Model Serving<br/>(KServe, vLLM, Ray Serve)"]
            MON["Monitoring<br/>(Prometheus, DCGM)"]
        end
    end

    subgraph "Layer 3: Infrastructure"
        subgraph "Orchestration"
            K8S["Kubernetes"]
            SCHED["Schedulers<br/>(Volcano, Kueue)"]
            RAY["Ray Cluster"]
        end
        subgraph "Compute & Storage"
            GPU["GPUs<br/>(A100, H100)"]
            CPU["CPUs"]
            STORE["Object Storage<br/>(S3, MinIO)"]
        end
        subgraph "GPU Management"
            GPUOP["GPU Operator"]
            DEVPLUGIN["Device Plugin"]
            MIG["MIG Manager"]
        end
    end

    App1 & App2 & App3 & App4 --> MS
    DP --> FE --> MT --> MR --> MS
    MS --> MON
    MS --> RAY
    MT --> RAY
    RAY --> K8S
    SCHED --> K8S
    K8S --> GPU & CPU & STORE
    GPUOP --> DEVPLUGIN --> GPU
    GPUOP --> MIG --> GPU

Benchmarks and Scale Considerations

vLLM Performance Characteristics

Based on the PagedAttention paper (arXiv:2309.06180):

  • PagedAttention achieves near-zero KV cache waste vs 60-80% waste in naive allocators
  • Continuous batching can improve throughput by 2-4x over static batching
  • Memory savings translate directly to higher concurrent request capacity

GPU Memory Budget (Inference)

For an LLM with P parameters at B bytes per parameter:

Model weights:  P × B bytes
KV cache:       Variable (grows with context length × batch size)
Overhead:       ~10-20% for framework, CUDA context

Example: Llama 2 70B at FP16:

  • Model weights: 70B × 2 bytes = 140 GB
  • Minimum: 2x A100 80GB (tensor parallelism)
  • With KV cache headroom: 4x A100 80GB for production batch sizes

GPU Interconnect Bandwidth

Interconnect Bandwidth Use Case
NVLink (H100) 900 GB/s Intra-node GPU communication
NVLink (A100) 600 GB/s Intra-node GPU communication
InfiniBand HDR 200 Gb/s (25 GB/s) Inter-node communication
InfiniBand NDR 400 Gb/s (50 GB/s) Inter-node communication
Ethernet (RoCE) 100-400 Gb/s Cost-effective inter-node

Topology-aware scheduling becomes critical when inter-node bandwidth is 10-30x lower than intra-node NVLink bandwidth.