LLM Operations¶
Deployment, serving engines, inference optimization, fine-tuning, distributed scaling, and production best practices for LLMs.
Serving Engines¶
vLLM¶
Open-source inference engine optimized for high throughput. Its core innovation is PagedAttention, which eliminates 60–80% of memory waste from KV cache fragmentation.
- 14–24x higher throughput than Hugging Face Transformers
- 2.2–3.5x higher throughput than early TGI
- OpenAI-compatible API out of the box
- Continuous batching, prefix caching, SLA-aware scheduling
- Stripe: 73% inference cost reduction, 50M daily API calls on 1/3 the GPU fleet
Distributed parallelism in vLLM:
| Strategy | When to Use | Config |
|---|---|---|
| Tensor Parallelism (TP) | Model too large for one GPU, fits one node | tensor_parallel_size=4 |
| Pipeline Parallelism (PP) | Model too large for one node | pipeline_parallel_size=N_nodes |
| TP + PP combined | Multi-node, large models | Set both parameters |
Default runtime: Ray for multi-node, Python multiprocessing for single-node.
TensorRT-LLM¶
NVIDIA's specialized inference library. Uses CUDA graph optimizations, fused kernels, and Tensor Core acceleration.
- H100 + FP8: >10,000 output tok/s at 64 concurrent requests, ~100ms TTFT
- Requires upfront "engine build" step per model/GPU/precision configuration
- Highest raw performance on NVIDIA hardware but complex to set up
SGLang¶
High-performance serving with RadixAttention for aggressive KV cache reuse across requests. Best for:
- Agentic workflows with repeated prefixes
- RAG systems with shared context
- Multi-turn conversations
Ollama¶
Single-command local inference. Wraps llama.cpp with a clean CLI and REST API.
Best for: prototyping, local development, personal use. Not production-grade at scale.
LM Studio¶
GUI-based local inference for GGUF models on macOS/Windows/Linux. Download models from Hugging Face, run with one click. Similar audience as Ollama but with a visual interface.
Engine Selection Guide¶
| Scenario | Recommended Engine |
|---|---|
| Fast time-to-serve, OpenAI-compatible | vLLM |
| Absolute lowest latency on NVIDIA | TensorRT-LLM |
| Agentic / RAG with prefix sharing | SGLang |
| Long conversations | TGI v3 (prefix caching) |
| Local prototyping | Ollama or LM Studio |
| Apple Silicon | MLX LM or Ollama |
Inference Optimization¶
KV Cache¶
During autoregressive generation, each new token's attention computation requires the Keys and Values of all previous tokens. The KV cache stores these to avoid redundant recomputation.
Problem: KV cache grows linearly with sequence length and batch size, becoming the primary memory bottleneck for long-context inference.
Optimization techniques:
| Technique | Description | Impact |
|---|---|---|
| PagedAttention (vLLM) | Manages KV cache in non-contiguous pages, like OS virtual memory | Eliminates 60–80% memory waste |
| KV Cache Quantization | Compress KV cache to INT4/INT2/FP4 | NVIDIA NVFP4: <1% accuracy loss vs BF16 |
| Token Pruning | Evict low-attention tokens from cache | Reduces memory for ultra-long contexts |
| Head Fusion | Merge similar attention heads' KV entries | Reduces cache size for GQA models |
| Entropy-Guided Caching | Allocate more cache to high-entropy (broadly attending) heads | Better quality per memory byte |
| Static KV Cache | Pre-allocate fixed-size cache | Enables torch.compile for up to 4x speedup |
Speculative Decoding¶
Uses a small, fast draft model to generate candidate tokens, then verifies them in a single forward pass of the large model. Correct tokens are accepted for free.
graph LR
A[Draft Model - 1B] -->|Generate 5 candidate tokens| B[Large Model - 70B]
B -->|Verify in single pass| C{Accept / Reject}
C -->|Accepted tokens| D[Output]
C -->|Rejected| E[Revert to large model generation]
- Typically 1.5–3x speedup with no quality loss
- DEFT (ICLR 2025): tree-structured speculative decoding achieves 2.2–3.6x speedup
- Prompt Lookup Decoding: uses the prompt itself as the draft source
Flash Attention¶
Optimizes attention computation by minimizing GPU memory movement (HBM ↔ SRAM transfers). Standard attention materializes the full $N \times N$ attention matrix; Flash Attention tiles the computation to keep working data in fast SRAM.
- Flash Attention 2: 2x faster than Flash Attention 1
- Flash Attention 3 (July 2024): further optimizations for H100
- FlashInfer (MLSys 2025): customizable attention engine with JIT compilation, integrated into SGLang, vLLM, and MLC-Engine
Batching Strategies¶
| Strategy | Description | Best For |
|---|---|---|
| Static Batching | Fixed batch, all requests start/end together | Simple but wasteful |
| Continuous Batching | New requests join the batch as slots free up | Standard for production serving |
| Disaggregated Prefill/Decode | Separate GPU pools for prefill vs decode phases | Advanced; used by NVIDIA Dynamo |
Parameter-Efficient Fine-Tuning (PEFT)¶
Full fine-tuning updates all parameters — prohibitively expensive for large models. PEFT methods train <1% of parameters while retaining 90–95% of full fine-tuning quality.
LoRA (Low-Rank Adaptation)¶
Injects trainable low-rank matrices into each transformer layer while freezing original weights.
How it works:
For a weight matrix $W \in \mathbb{R}^{d \times k}$, instead of updating $W$ directly, LoRA adds:
$$ W' = W + \Delta W = W + BA $$
Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$ (typically 8–64).
| Property | Value |
|---|---|
| Trainable params | 0.2–0.3% of total |
| Adapter size | Few MB (vs GB for full model) |
| Inference cost | Zero — adapters merge into base weights |
| Task switching | Swap adapter files without reloading base model |
| Quality | Competitive with full fine-tuning for most tasks |
QLoRA (Quantized LoRA)¶
Loads the base model in 4-bit NormalFloat4 quantization while training LoRA adapters in higher precision:
- 75–80% memory reduction vs 16-bit LoRA
- Enables fine-tuning 65B models on a single 48GB GPU
- Quality on par with full 16-bit fine-tuning in many cases
Key innovations:
- NF4 data type: optimized for normally distributed weights
- Double quantization: compresses scale/offset constants themselves
- Unified memory paging: seamless GPU↔CPU transfers when GPU OOM
Adapter Modules¶
Small feed-forward networks inserted after attention or FFN sublayers. Base model is frozen; only adapter weights train.
- More modular than LoRA (can mix/match per task)
- Slight inference overhead (adapters don't merge into base weights)
- Useful for multi-task serving with shared base
When to Use Which¶
| Method | Best For | Hardware |
|---|---|---|
| Full Fine-Tuning | Maximum quality, small models (<7B) | 8x A100 or equivalent |
| LoRA | General fine-tuning, easy deployment | 1–2x A100 |
| QLoRA | Large models (13B–70B), limited VRAM | Single 24–48GB GPU |
| Adapters | Multi-task serving, modular systems | Similar to LoRA |
Tooling¶
- Hugging Face PEFT: canonical library;
model.add_adapter()integrates with Transformers - bitsandbytes: 4-bit/8-bit quantization for QLoRA
- Unsloth: 2x faster LoRA/QLoRA training with custom CUDA kernels
- Axolotl: config-driven fine-tuning framework wrapping multiple methods
Distributed Inference and Scaling¶
Parallelism Strategies¶
| Strategy | What It Splits | When to Use |
|---|---|---|
| Tensor Parallelism (TP) | Individual layer weights across GPUs | Model too large for one GPU |
| Pipeline Parallelism (PP) | Sequential layers across GPUs/nodes | Multi-node deployment |
| Data Parallelism (DP) | Replicas of the full model | High throughput, model fits one GPU |
| Expert Parallelism (EP) | MoE experts across GPUs | MoE models with many experts |
NVIDIA Dynamo¶
Announced at GTC 2025, Dynamo is a distributed inference orchestration layer on top of vLLM/TensorRT-LLM/SGLang:
- Disaggregated prefill and decode: separate GPU pools optimized for each phase
- Coordinates work across GPU pools
- Smart request routing based on KV cache locality
llm-d (Kubernetes-Native)¶
Launched May 2025 by Red Hat, Google Cloud, IBM, NVIDIA, and CoreWeave:
- Kubernetes-native distributed LLM serving
- Disaggregated prefill/decode stages
- Gateway API Inference Extension for routing
- Dynamic Resource Allocation (DRA) for GPU scheduling
Multi-Model Routing¶
For production deployments with multiple models:
| Tool | Purpose |
|---|---|
| LiteLLM | Unified API gateway for 100+ LLM providers; fallback routing |
| Envoy AI Gateway | Proxy-level routing, rate limiting, auth |
| OpenRouter | Third-party multi-model API with cost optimization |
Production Best Practices¶
Deployment Lifecycle¶
graph LR
A[Prototype with Ollama] --> B[Validate with vLLM/SGLang]
B --> C[Optimize: Quantization + Batching]
C --> D[Load Test: Latency + Throughput]
D --> E[Deploy: K8s + Autoscaling]
E --> F[Monitor: Latency SLOs + Quality]
Checklist¶
Pre-Production Checklist
Model Selection
- Benchmark candidate models on your actual task distribution
- Test quantized variants (Q4_K_M, AWQ, FP8) against FP16 baseline
- Validate edge cases: long inputs, multilingual, structured output
Infrastructure
- Right-size GPU selection (H100 for throughput, A10G/L4 for cost, Apple Silicon for privacy)
- Configure tensor parallelism if model exceeds single-GPU VRAM
- Set up continuous batching with appropriate max batch size
- Enable prefix caching for repetitive prompt patterns
Reliability
- Deploy multiple replicas behind a load balancer
- Configure autoscaling based on queue depth, not just CPU/GPU utilization
- Set request timeouts and max token limits
- Implement circuit breakers and fallback to smaller/cached models
- Test failover by terminating instances under load
Monitoring
- Track Time-to-First-Token (TTFT), tokens/second, and end-to-end latency at p50/p95/p99
- Monitor GPU utilization, VRAM usage, and KV cache occupancy
- Log prompt/response lengths for capacity planning
- Set up alerts for latency SLO violations and OOM events
Quality
- Implement output validation (JSON schema, safety filters)
- Run periodic eval benchmarks against held-out test sets
- Monitor for model drift after updates or quantization changes
Cost Optimization¶
| Technique | Savings | Tradeoff |
|---|---|---|
| Quantization (FP16 → INT4) | 70–75% VRAM, 2x speed | Minor quality loss |
| Speculative decoding | 1.5–3x throughput | Draft model complexity |
| Prefix caching | 30–60% for repetitive prompts | Memory for cache storage |
| Request batching | 3–10x throughput | Slightly higher latency |
| Spot/preemptible instances | 60–80% compute cost | Requires graceful interruption handling |
| Model distillation | 5–10x cheaper inference | Upfront distillation cost |
Common Pitfalls¶
Avoid These
- Over-quantizing for your use case: Q2/Q3 works for casual chat but breaks agentic workflows, JSON output, and code generation
- Ignoring TTFT: users perceive time-to-first-token as "speed" more than tokens/second
- Static batching in production: wastes GPU cycles waiting for the longest request in the batch
- No fallback strategy: a single model endpoint is a single point of failure
- Benchmarking with synthetic data: real traffic patterns (variable lengths, bursty arrivals) behave very differently from uniform benchmarks
- Skipping load testing: KV cache OOM under concurrent load is the most common production failure
CLI Recipes¶
Ollama (Local Inference)¶
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3:8b-q4_K_M
ollama run llama3:8b-q4_K_M
# List models
ollama list
# Serve as API
ollama serve # default: http://localhost:11434
curl http://localhost:11434/api/generate -d '{"model":"llama3:8b-q4_K_M","prompt":"Hello"}'
vLLM (Production Serving)¶
# Install
pip install vllm
# Serve a model with tensor parallelism
vllm serve meta-llama/Llama-3-8B-Instruct \
--tensor-parallel-size 2 \
--quantization awq \
--max-model-len 8192 \
--port 8000
# OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'
MLX LM (Apple Silicon)¶
# Install
pip install mlx-lm
# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B \
--quantize --q-bits 4 -o ./llama3-8b-4bit
# Generate
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit \
--prompt "Explain transformers" --max-tokens 500
# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3-8B-4bit \
--train --data ./train.jsonl --iters 1000
llama.cpp (GGUF Inference)¶
# Build
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release
# Run inference
./build/bin/llama-cli -m ./models/llama3-8b-q4_K_M.gguf \
-p "What is quantization?" -n 256
# Start API server
./build/bin/llama-server -m ./models/llama3-8b-q4_K_M.gguf \
--host 0.0.0.0 --port 8080 -ngl 99 # -ngl: layers offloaded to GPU
Quantization with llama.cpp¶
# Convert HF model to GGUF
python convert_hf_to_gguf.py ./models/llama3-8b/ --outfile llama3-8b-f16.gguf
# Quantize
./build/bin/llama-quantize llama3-8b-f16.gguf llama3-8b-q4_K_M.gguf Q4_K_M
Fine-Tuning with QLoRA (Hugging Face)¶
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
)
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()
# → trainable: 0.2% of total