AI Platform Engineering¶

Summary

AI Platform Engineering is the discipline of designing, building, and operating the infrastructure stack that takes machine learning models from experiments to reliable production services. Unlike traditional platform engineering (CPU-centric, database-backed), AI platform engineering centers on GPU compute — the most expensive and constrained resource in the stack. It extends Kubernetes with specialized schedulers, device plugins, distributed compute frameworks, and inference-optimized serving engines to maximize GPU utilization while maintaining performance SLAs.

Why AI Workloads Are Different¶

Traditional applications follow a well-understood pattern: User → API → Database. The bottleneck is typically CPU, memory, or database performance, and horizontal scaling is straightforward.

AI inference workloads follow a fundamentally different pattern: User → Inference Server → GPU → Model Weights. The bottleneck shifts from data serving to compute serving:

Dimension	Traditional Applications	AI Applications
Primary resource	CPU, memory	GPU memory, GPU compute
Bottleneck	Database I/O, network latency	Model loading time, memory bandwidth
Idle cost	Idle CPU is tolerable	Idle GPU is extremely expensive
Scaling	Horizontal (add replicas)	GPU-aware (multi-GPU coordination)
Scheduling	Place anywhere with capacity	Topology-aware, gang scheduling
Serving	Stateless request/response	Stateful KV cache, batched inference

Core Problem Statement¶

The central challenge is GPU utilization. Organizations invest thousands of dollars in GPU hardware to accelerate model execution. If those GPUs sit idle or are underutilized due to resource fragmentation, infrastructure costs increase without delivering value.

Platform engineers must answer:

How do we keep GPUs busy?
How do we schedule workloads without wasting compute?
How do we serve more inference requests per GPU?
How do we orchestrate distributed training across multiple GPUs and nodes?

Why Kubernetes Alone Is Not Enough¶

Kubernetes excels at scheduling containers against CPU and memory resources. AI workloads introduce requirements that standard Kubernetes cannot handle natively:

GPU-aware scheduling — GPUs must be discovered and registered via device plugins before Kubernetes can manage them
Gang scheduling — distributed training jobs require all GPUs allocated simultaneously or not at all
Topology awareness — GPU placement across nodes affects inter-GPU communication latency
Multi-GPU coordination — training and inference may span multiple GPUs on a single node or across nodes
Resource sharing — time-slicing and MIG partitioning for efficient sub-GPU allocation
Distributed compute — task and actor scheduling above the infrastructure layer (Ray)
Inference optimization — KV cache management, continuous batching, PagedAttention (vLLM)

This is why the AI infrastructure ecosystem has developed tools like Kubeflow, KServe, Ray, vLLM, Volcano, and Kueue.

The Three-Layer AI Platform Architecture¶

A production AI platform consists of three distinct layers:

Layer 1: AI Applications (Business Layer)¶

The products users interact with:

Virtual assistants and chatbots
Recommendation systems
Fraud detection pipelines
Content generation services
IoT analytics

Layer 2: AI Platform Services (MLOps Layer)¶

The platform that takes raw data and turns it into production models:

Stage	Purpose	Example Tools
Data Processing	Collect, clean, label, store data	Spark, Flink, Label Studio
Feature Engineering	Transform raw data into model features	Feature Stores (Feast, Tecton)
Model Training	Distributed GPU training	Kubeflow Training Operator, Ray Train
Model Registry	Version, store, track models	MLflow, Weights & Biases
Deployment & Inference	Serve predictions reliably	KServe, Ray Serve, vLLM
Monitoring	Track latency, drift, costs	Prometheus, Grafana, Evidently

Layer 3: Infrastructure (Compute Layer)¶

The foundation everything depends on:

Category	Components
Compute	CPUs, GPUs, TPUs
Storage	Object storage, data lakes, feature stores
Orchestration	Kubernetes, networking, scheduling
Accelerators	NVIDIA GPUs (A100, H100), Google TPUs

Tool Landscape¶

Layer	Tool	Purpose
Training Pipelines	Kubeflow, Argo Workflows	End-to-end ML workflow orchestration
Model Registry	MLflow	Model versioning, tracking, and metadata
Distributed Compute	Ray	Task and actor scheduling across GPU clusters
Model Serving	KServe, vLLM	Production inference with GPU optimization
Batch Scheduling	Volcano, Kueue	Gang scheduling, queue management, fair-share
GPU Management	NVIDIA GPU Operator	Driver lifecycle, device plugins, monitoring
Infrastructure	Kubernetes	Container orchestration and resource management

Evaluation¶

Dimension	Assessment
Maturity	Growing — CNCF ecosystem extending rapidly for AI workloads
Complexity	High — multi-layer stack with many moving parts
Cost Sensitivity	Critical — GPU costs dominate infrastructure budgets
Ecosystem	Active — Kubernetes, Ray, vLLM, Kubeflow all under active development
Entry Barrier	Moderate — requires Kubernetes expertise plus GPU/ML domain knowledge

Pros	Cons
Leverages existing Kubernetes skills	Multi-layer complexity increases operational burden
Modular — components can be adopted incrementally	GPU hardware costs remain significant
Active open-source ecosystem (CNCF, Ray, vLLM)	Fast-moving ecosystem means frequent breaking changes
Enables GPU sharing and cost optimization	Requires specialized knowledge (GPU topology, distributed training)
Supports both training and inference workloads	Monitoring and observability tooling still maturing

LLM Fundamentals — model architecture, quantization, serving engines
Kubernetes — container orchestration fundamentals
Docker — container runtime

Sources¶

7 Days of AI Platform Engineering — Milind Dethe — primary source for this research
Day 1: Why AI Workloads Are Different
Day 2: GPU Discovery in Kubernetes
Day 3: GPU Utilization & Resource Fragmentation
Day 4: AI Workload Scheduling
Day 5: Ray — Distributed Compute
Day 6: vLLM — Efficient LLM Serving
Day 7: The Full AI Platform Stack
Kubernetes Device Plugins — official documentation on extending K8s for hardware accelerators
NVIDIA MIG User Guide — Multi-Instance GPU partitioning
Run:ai GPU Time-Slicing — advanced GPU time-slicing modes
Ray Cluster Key Concepts — head node, worker nodes, autoscaling
Ray on Kubernetes — KubeRay deployment patterns
vLLM — high-throughput LLM serving engine
PagedAttention Paper (arXiv:2309.06180) — memory-efficient attention for LLM serving
KV Caching Explained — Hugging Face — visual guide to KV cache mechanics

Questions¶

How will disaggregated prefill/decode architectures (NVIDIA Dynamo, llm-d) change the platform layer?
What is the practical cost comparison between MIG partitioning vs time-slicing vs dedicated GPU allocation for inference workloads?
How do Volcano and Kueue compare for production gang scheduling, and when should each be used?
Will Ray become the default distributed compute layer for all AI platforms, or will Kubernetes-native solutions catch up?
How should platform teams approach GPU capacity planning when model sizes and inference patterns change rapidly?
What monitoring signals beyond standard latency/throughput are essential for GPU-intensive workloads (GPU memory fragmentation, SM utilization, NVLink bandwidth)?