AI Platform Engineering — Architecture¶

GPU Discovery in Kubernetes¶

Kubernetes does not natively understand GPU hardware. A multi-layer discovery path makes GPUs visible as schedulable resources:

GPU Hardware
      |
NVIDIA Driver (OS-level)
      |
Device Plugin (kubelet integration)
      |
Kubernetes Node (resource advertised)
      |
Pod Requests GPU

Step 1: Physical GPU + Driver¶

The physical GPU is attached to a worker node. The operating system exposes it through vendor-specific drivers. For NVIDIA GPUs, the NVIDIA driver must be installed on the node before the OS can interact with the hardware.

Step 2: Kubernetes Awareness Gap¶

Even with the driver installed, Kubernetes remains unaware of the GPU. Kubernetes natively understands three resource types:

CPU (cpu)
Memory (memory)
Ephemeral storage (ephemeral-storage)

GPUs must be explicitly registered through the Device Plugin API.

Step 3: Device Plugin Registration¶

A Device Plugin is a Kubernetes extension that advertises specialized hardware to the kubelet. The NVIDIA Device Plugin runs as a DaemonSet on each GPU node and reports available GPUs.

Once registered, Kubernetes sees the resource:

# Node capacity after device plugin registration
nvidia.com/gpu: 4

Key Insight

Kubernetes does not schedule GPUs because it understands GPUs. It schedules GPUs because a Device Plugin exposes them as generic extended resources. The same mechanism works for TPUs, FPGAs, SmartNICs, and any other accelerator with a Device Plugin implementation.

Step 4: Pod GPU Requests¶

After registration, workloads request GPUs identically to CPU and memory:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-inference
spec:
  containers:
    - name: model-server
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/gpu: 1

The scheduler matches resource requests with available resources — it has no knowledge of whether the workload is running an LLM, image generator, or training job.

Device Plugin Architecture¶

sequenceDiagram
    participant GPU as GPU Hardware
    participant Driver as NVIDIA Driver
    participant DP as NVIDIA Device Plugin
    participant Kubelet as kubelet
    participant Scheduler as kube-scheduler
    participant Pod as Pod

    GPU->>Driver: Hardware attached
    Driver->>DP: GPU devices available
    DP->>Kubelet: Register via gRPC<br/>ListAndWatch(nvidia.com/gpu: 4)
    Kubelet->>Scheduler: Node capacity updated
    Pod->>Scheduler: Request nvidia.com/gpu: 1
    Scheduler->>Kubelet: Schedule Pod on GPU node
    Kubelet->>DP: Allocate(deviceID)
    DP->>Pod: Mount GPU device + env vars

The Device Plugin communicates with the kubelet via a gRPC interface at /var/lib/kubelet/device-plugins/. The plugin must handle kubelet restarts by monitoring socket deletion and re-registering. The Device Plugin API supports:

ListAndWatch — advertises available devices and reports health changes
Allocate — provisions device access for containers (device nodes, environment variables, mounts, CDI device names)
Health monitoring — marks devices as unhealthy when failures are detected, reducing the node's allocatable count

Since Kubernetes v1.36 (beta), allocatedResourcesStatus in pod status reports per-device health information including error details and failure reasons.

NVIDIA GPU Operator¶

The NVIDIA GPU Operator automates the full GPU lifecycle on Kubernetes nodes:

Component	Purpose
NVIDIA Driver	Kernel-level GPU access
NVIDIA Device Plugin	GPU advertisement to Kubernetes
NVIDIA Container Toolkit	Container runtime GPU integration
DCGM Exporter	GPU metrics for Prometheus
Node Feature Discovery	Labels nodes with GPU properties
MIG Manager	Multi-Instance GPU partition management
GPU Feature Discovery	Exposes GPU model, memory, driver version as node labels

This eliminates manual driver installation, device plugin deployment, and monitoring setup across the cluster.

GPU Utilization and Resource Fragmentation¶

The Core Problem¶

Standard Kubernetes GPU allocation is binary — a pod requests one or more whole GPUs, and each GPU is allocated exclusively:

Pod A → 1 GPU Requested → 1 GPU Allocated (80 GB)
Actual usage: 10 GB memory, 20% compute
Waste: 70 GB memory, 80% compute

With CPUs, Kubernetes efficiently bin-packs multiple pods onto a single node. GPUs traditionally do not support this — one pod per GPU, regardless of actual utilization.

Cost Impact¶

Consider a cluster with 8x NVIDIA A100 GPUs where every workload uses only 25% of each GPU:

Effective utilization: 2 GPUs worth of useful work
Paid for: 8 GPUs
Waste: 75% of GPU investment

At $30,000+ per A100 GPU, this translates to ~$180,000 in wasted capacity.

Three primary techniques address GPU underutilization:

Time-Slicing¶

Multiple workloads take turns using the same GPU, analogous to CPU time-sharing. NVIDIA's Run:ai implementation provides two modes:

Mode	Behavior	K8s Mapping
Strict	Each workload gets exactly its requested GPU compute fraction	`gpu-compute-request = gpu-compute-limit = gpu-fraction`
Fair	Each workload gets at least its fraction, plus unused slices from idle workloads	`gpu-compute-request = gpu-fraction`, `gpu-compute-limit = 1.0`

Time-slicing operates on a plan/lease cycle. Default configuration:

Lease time: 250ms (exclusive GPU access per workload)
Granularity: 5% precision
Plan (cycle) time: 250ms / 0.05 = 5000ms (5 seconds)

A workload requesting gpu-fraction=0.5 gets 2.5s of runtime per 5s cycle.

Trade-offs

Decreasing lease time makes time-slicing less accurate. Increasing lease time improves accuracy but reduces workload responsiveness. Context switching between workloads adds overhead.

Multi-Instance GPU (MIG)¶

Available on NVIDIA Ampere+ GPUs (A100, H100). MIG partitions a single GPU into up to 7 isolated GPU Instances, each with dedicated:

Streaming Multiprocessors (SMs)
GPU engines (copy engines, decoders)
L2 cache banks
Memory controllers
DRAM address busses

Example partitioning of an 80 GB A100:

Full A100 (80 GB)
├── MIG Instance 1: 10 GB (1g.10gb)
├── MIG Instance 2: 10 GB (1g.10gb)
├── MIG Instance 3: 20 GB (2g.20gb)
└── MIG Instance 4: 40 GB (4g.40gb)

Each instance provides hardware-level isolation — one workload cannot impact the L2 cache or DRAM bandwidth of another. This makes MIG suitable for multi-tenant environments where QoS guarantees are required.

MIG supports:

Bare-metal and containers
GPU passthrough virtualization
vGPU on supported hypervisors

Continuous Batching (Inference)¶

Instead of processing inference requests one-by-one, the serving engine combines multiple requests into a single GPU execution. vLLM's continuous batching dynamically adds new requests as older ones complete, keeping the GPU busy continuously rather than waiting for full batch formation.

AI Workload Scheduling¶

Why Standard Kubernetes Scheduling Falls Short¶

Traditional applications are loosely coupled — components can start independently and tolerate staggered scheduling. AI training jobs have fundamentally different requirements.

Gang Scheduling¶

Distributed training jobs require all resources allocated simultaneously:

Training Job (requires 8 GPUs)
├── Worker 0: GPU 0
├── Worker 1: GPU 1
├── Worker 2: GPU 2
├── Worker 3: GPU 3
├── Worker 4: GPU 4
├── Worker 5: GPU 5
├── Worker 6: GPU 6
└── Worker 7: GPU 7

If only 6 GPUs are available, the job cannot start. Partial allocation wastes resources — workers wait indefinitely for the remaining GPUs, blocking other jobs.

Gang scheduling rule: Either schedule all required resources together, or schedule none of them.

Topology Awareness¶

GPU placement affects training performance significantly. Same-node GPUs communicate via high-speed interconnects (NVLink at 900 GB/s on H100), while cross-node GPUs use network fabric (InfiniBand at ~400 Gb/s).

graph LR
    subgraph "Node A (Fast: NVLink)"
        GPU0["GPU 0"]
        GPU1["GPU 1"]
        GPU2["GPU 2"]
        GPU3["GPU 3"]
        GPU0 <--> GPU1
        GPU1 <--> GPU2
        GPU2 <--> GPU3
    end

    subgraph "Node B (Fast: NVLink)"
        GPU4["GPU 4"]
        GPU5["GPU 5"]
        GPU6["GPU 6"]
        GPU7["GPU 7"]
        GPU4 <--> GPU5
        GPU5 <--> GPU6
        GPU6 <--> GPU7
    end

    GPU3 <-.->|"Slower: Network Fabric"| GPU4

A topology-aware scheduler prefers placing all GPUs on the same node when possible, falling back to nodes with the best inter-node connectivity.

Scheduling Tools Comparison¶

Tool	Type	Key Capabilities
Volcano	Kubernetes-native batch scheduler	Gang scheduling, queue management, fair-share, priority-based preemption
Kueue	Kubernetes SIG job queueing	Admission control, resource quotas, job queuing, cluster queue management
Kubeflow Training Operator	Distributed training CRD	PyTorchJob, TFJob, XGBoostJob — works with Volcano for gang scheduling
NVIDIA GPU Operator	GPU lifecycle manager	Driver management, device plugins, DCGM metrics, MIG management

Ray — Distributed Compute Framework¶

The Two-Layer Model¶

graph TB
    subgraph "Application Layer"
        User["User / Application"]
        RayDriver["Ray Driver"]
    end

    subgraph "Ray Layer (Computation Scheduling)"
        RayHead["Ray Head Node<br/>(GCS, Autoscaler, Dashboard)"]
        RayWorker1["Ray Worker 1"]
        RayWorker2["Ray Worker 2"]
        RayWorker3["Ray Worker 3"]
    end

    subgraph "Kubernetes Layer (Infrastructure Scheduling)"
        K8s["kube-scheduler"]
        Pod1["Pod (Head)"]
        Pod2["Pod (Worker)"]
        Pod3["Pod (Worker)"]
        Pod4["Pod (Worker)"]
    end

    User --> RayDriver
    RayDriver --> RayHead
    RayHead --> RayWorker1
    RayHead --> RayWorker2
    RayHead --> RayWorker3

    K8s --> Pod1
    K8s --> Pod2
    K8s --> Pod3
    K8s --> Pod4

    Pod1 -.- RayHead
    Pod2 -.- RayWorker1
    Pod3 -.- RayWorker2
    Pod4 -.- RayWorker3

Kubernetes schedules infrastructure (which node should this pod run on?). Ray schedules computation (which worker executes which task? how are results collected?).

Ray Architecture¶

Component	Role
Head Node	Runs GCS (Global Control Service), autoscaler, Ray dashboard. Also schedules tasks like worker nodes.
Worker Nodes	Execute Ray tasks and actors. Participate in distributed object storage.
Autoscaler	Scales worker nodes based on task/actor resource requests (not CPU/memory metrics).
GCS	Central metadata store for cluster state, actor locations, and resource availability.

Tasks vs Actors¶

Dimension	Tasks	Actors
State	Stateless	Stateful
Lifecycle	Run once, return result	Long-lived, handle multiple requests
Use case	Data processing, hyperparameter search	Model serving, stateful computation
Invocation	`function.remote(args)`	`actor.method.remote(args)`

Tasks enable embarrassingly parallel workloads (data processing, hyperparameter tuning). Actors enable stateful services (model serving, game environments, RL training).

Ray Libraries¶

Library	Purpose
Ray Train	Distributed training (PyTorch, TensorFlow, XGBoost)
Ray Tune	Hyperparameter optimization
Ray Serve	Model serving and composition
Ray Data	Distributed data processing
Ray RLlib	Reinforcement learning

Ray on Kubernetes (KubeRay)¶

KubeRay provides Kubernetes CRDs for managing Ray clusters:

RayCluster — manages head and worker pods
RayJob — submits jobs to a Ray cluster
RayService — manages Ray Serve deployments with zero-downtime upgrades

The autoscaler in KubeRay v2 runs as a sidecar container in the head pod, scaling worker pods based on pending Ray task/actor resource demands.

vLLM — Inference Engine Architecture¶

Why Naive Model Serving Fails at Scale¶

A naive inference server processes requests sequentially or in static batches:

Request 1 → GPU → Response 1
Request 2 → GPU → Response 2  (waits for Request 1)
Request 3 → GPU → Response 3  (waits for Request 2)

With 100 concurrent users, GPU utilization stays low because the GPU cannot exploit its parallel architecture. This is the GPU utilization problem from Day 3 applied to inference serving.

PagedAttention¶

The key innovation in vLLM. Traditional serving systems allocate GPU memory for the KV cache in large, contiguous chunks. For variable-length sequences, this causes:

Internal fragmentation — allocated blocks larger than needed
External fragmentation — free memory scattered in unusable small chunks
Reservation waste — memory reserved for maximum sequence length even for short sequences

PagedAttention treats the KV cache like virtual memory — memory is managed in fixed-size blocks (pages) that need not be contiguous:

Traditional KV Cache:
[████████████░░░░░░░░] Request 1 (wasted space)
[████████░░░░░░░░░░░░] Request 2 (wasted space)
[░░░░░░░░░░░░░░░░░░░░] Free (fragmented)

PagedAttention:
[████][████][████][██] Request 1 (pages, no waste)
[████][████][██]       Request 2 (pages, no waste)
[████][████]           Free (reusable pages)

Results:

Near-zero memory waste
2-4x more concurrent requests with the same GPU
Dynamic memory allocation as sequences grow

Continuous Batching¶

Traditional batching waits for a full batch before processing:

Traditional:    Wait → Process Batch → Wait → Process Batch
Continuous:     Process ─── Process ─── Process ─── Process
                (new requests added as old ones complete)

vLLM's continuous batching adds new requests to the running batch as existing requests finish generating tokens. The GPU stays busy continuously, dramatically improving throughput.

vLLM in the Platform Stack¶

Users
  |
vLLM (inference optimization: PagedAttention + continuous batching)
  |
Model Weights (loaded into GPU memory)
  |
GPU (compute)

In a Kubernetes deployment:

Users
  |
Load Balancer / Gateway
  |
vLLM Pods (KServe InferenceService or Ray Serve)
  |
Kubernetes (scheduling, scaling, health checks)
  |
GPU Nodes (NVIDIA Device Plugin, GPU Operator)

vLLM exposes an OpenAI-compatible API, making it a drop-in replacement for existing applications that use the OpenAI API format.

Full Platform Architecture Diagram¶

graph TB
    subgraph "Layer 1: AI Applications"
        App1["Virtual Assistants"]
        App2["Recommendation Systems"]
        App3["Fraud Detection"]
        App4["Content Generation"]
    end

    subgraph "Layer 2: AI Platform Services"
        subgraph "Data Pipeline"
            DP["Data Processing"]
            FE["Feature Engineering"]
        end
        subgraph "Model Lifecycle"
            MT["Model Training<br/>(Kubeflow, Ray Train)"]
            MR["Model Registry<br/>(MLflow)"]
        end
        subgraph "Serving & Monitoring"
            MS["Model Serving<br/>(KServe, vLLM, Ray Serve)"]
            MON["Monitoring<br/>(Prometheus, DCGM)"]
        end
    end

    subgraph "Layer 3: Infrastructure"
        subgraph "Orchestration"
            K8S["Kubernetes"]
            SCHED["Schedulers<br/>(Volcano, Kueue)"]
            RAY["Ray Cluster"]
        end
        subgraph "Compute & Storage"
            GPU["GPUs<br/>(A100, H100)"]
            CPU["CPUs"]
            STORE["Object Storage<br/>(S3, MinIO)"]
        end
        subgraph "GPU Management"
            GPUOP["GPU Operator"]
            DEVPLUGIN["Device Plugin"]
            MIG["MIG Manager"]
        end
    end

    App1 & App2 & App3 & App4 --> MS
    DP --> FE --> MT --> MR --> MS
    MS --> MON
    MS --> RAY
    MT --> RAY
    RAY --> K8S
    SCHED --> K8S
    K8S --> GPU & CPU & STORE
    GPUOP --> DEVPLUGIN --> GPU
    GPUOP --> MIG --> GPU

Benchmarks and Scale Considerations¶

vLLM Performance Characteristics¶

Based on the PagedAttention paper (arXiv:2309.06180):

PagedAttention achieves near-zero KV cache waste vs 60-80% waste in naive allocators
Continuous batching can improve throughput by 2-4x over static batching
Memory savings translate directly to higher concurrent request capacity

GPU Memory Budget (Inference)¶

For an LLM with P parameters at B bytes per parameter:

Model weights:  P × B bytes
KV cache:       Variable (grows with context length × batch size)
Overhead:       ~10-20% for framework, CUDA context

Example: Llama 2 70B at FP16:

Model weights: 70B × 2 bytes = 140 GB
Minimum: 2x A100 80GB (tensor parallelism)
With KV cache headroom: 4x A100 80GB for production batch sizes

GPU Interconnect Bandwidth¶

Interconnect	Bandwidth	Use Case
NVLink (H100)	900 GB/s	Intra-node GPU communication
NVLink (A100)	600 GB/s	Intra-node GPU communication
InfiniBand HDR	200 Gb/s (25 GB/s)	Inter-node communication
InfiniBand NDR	400 Gb/s (50 GB/s)	Inter-node communication
Ethernet (RoCE)	100-400 Gb/s	Cost-effective inter-node

Topology-aware scheduling becomes critical when inter-node bandwidth is 10-30x lower than intra-node NVLink bandwidth.