Skip to content

LLM Operations

Deployment, serving engines, inference optimization, fine-tuning, distributed scaling, and production best practices for LLMs.


Serving Engines

vLLM

Open-source inference engine optimized for high throughput. Its core innovation is PagedAttention, which eliminates 60–80% of memory waste from KV cache fragmentation.

  • 14–24x higher throughput than Hugging Face Transformers
  • 2.2–3.5x higher throughput than early TGI
  • OpenAI-compatible API out of the box
  • Continuous batching, prefix caching, SLA-aware scheduling
  • Stripe: 73% inference cost reduction, 50M daily API calls on 1/3 the GPU fleet

Distributed parallelism in vLLM:

Strategy When to Use Config
Tensor Parallelism (TP) Model too large for one GPU, fits one node tensor_parallel_size=4
Pipeline Parallelism (PP) Model too large for one node pipeline_parallel_size=N_nodes
TP + PP combined Multi-node, large models Set both parameters

Default runtime: Ray for multi-node, Python multiprocessing for single-node.

TensorRT-LLM

NVIDIA's specialized inference library. Uses CUDA graph optimizations, fused kernels, and Tensor Core acceleration.

  • H100 + FP8: >10,000 output tok/s at 64 concurrent requests, ~100ms TTFT
  • Requires upfront "engine build" step per model/GPU/precision configuration
  • Highest raw performance on NVIDIA hardware but complex to set up

SGLang

High-performance serving with RadixAttention for aggressive KV cache reuse across requests. Best for:

  • Agentic workflows with repeated prefixes
  • RAG systems with shared context
  • Multi-turn conversations

Ollama

Single-command local inference. Wraps llama.cpp with a clean CLI and REST API.

ollama pull llama3:8b-q4_K_M
ollama run llama3:8b-q4_K_M "What is attention?"

Best for: prototyping, local development, personal use. Not production-grade at scale.

LM Studio

GUI-based local inference for GGUF models on macOS/Windows/Linux. Download models from Hugging Face, run with one click. Similar audience as Ollama but with a visual interface.

Engine Selection Guide

Scenario Recommended Engine
Fast time-to-serve, OpenAI-compatible vLLM
Absolute lowest latency on NVIDIA TensorRT-LLM
Agentic / RAG with prefix sharing SGLang
Long conversations TGI v3 (prefix caching)
Local prototyping Ollama or LM Studio
Apple Silicon MLX LM or Ollama

Inference Optimization

KV Cache

During autoregressive generation, each new token's attention computation requires the Keys and Values of all previous tokens. The KV cache stores these to avoid redundant recomputation.

Problem: KV cache grows linearly with sequence length and batch size, becoming the primary memory bottleneck for long-context inference.

Optimization techniques:

Technique Description Impact
PagedAttention (vLLM) Manages KV cache in non-contiguous pages, like OS virtual memory Eliminates 60–80% memory waste
KV Cache Quantization Compress KV cache to INT4/INT2/FP4 NVIDIA NVFP4: <1% accuracy loss vs BF16
Token Pruning Evict low-attention tokens from cache Reduces memory for ultra-long contexts
Head Fusion Merge similar attention heads' KV entries Reduces cache size for GQA models
Entropy-Guided Caching Allocate more cache to high-entropy (broadly attending) heads Better quality per memory byte
Static KV Cache Pre-allocate fixed-size cache Enables torch.compile for up to 4x speedup

Speculative Decoding

Uses a small, fast draft model to generate candidate tokens, then verifies them in a single forward pass of the large model. Correct tokens are accepted for free.

graph LR
    A[Draft Model - 1B] -->|Generate 5 candidate tokens| B[Large Model - 70B]
    B -->|Verify in single pass| C{Accept / Reject}
    C -->|Accepted tokens| D[Output]
    C -->|Rejected| E[Revert to large model generation]
  • Typically 1.5–3x speedup with no quality loss
  • DEFT (ICLR 2025): tree-structured speculative decoding achieves 2.2–3.6x speedup
  • Prompt Lookup Decoding: uses the prompt itself as the draft source

Flash Attention

Optimizes attention computation by minimizing GPU memory movement (HBM ↔ SRAM transfers). Standard attention materializes the full $N \times N$ attention matrix; Flash Attention tiles the computation to keep working data in fast SRAM.

  • Flash Attention 2: 2x faster than Flash Attention 1
  • Flash Attention 3 (July 2024): further optimizations for H100
  • FlashInfer (MLSys 2025): customizable attention engine with JIT compilation, integrated into SGLang, vLLM, and MLC-Engine

Batching Strategies

Strategy Description Best For
Static Batching Fixed batch, all requests start/end together Simple but wasteful
Continuous Batching New requests join the batch as slots free up Standard for production serving
Disaggregated Prefill/Decode Separate GPU pools for prefill vs decode phases Advanced; used by NVIDIA Dynamo

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all parameters — prohibitively expensive for large models. PEFT methods train <1% of parameters while retaining 90–95% of full fine-tuning quality.

LoRA (Low-Rank Adaptation)

Injects trainable low-rank matrices into each transformer layer while freezing original weights.

How it works:

For a weight matrix $W \in \mathbb{R}^{d \times k}$, instead of updating $W$ directly, LoRA adds:

$$ W' = W + \Delta W = W + BA $$

Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$ (typically 8–64).

Property Value
Trainable params 0.2–0.3% of total
Adapter size Few MB (vs GB for full model)
Inference cost Zero — adapters merge into base weights
Task switching Swap adapter files without reloading base model
Quality Competitive with full fine-tuning for most tasks

QLoRA (Quantized LoRA)

Loads the base model in 4-bit NormalFloat4 quantization while training LoRA adapters in higher precision:

  • 75–80% memory reduction vs 16-bit LoRA
  • Enables fine-tuning 65B models on a single 48GB GPU
  • Quality on par with full 16-bit fine-tuning in many cases

Key innovations:

  • NF4 data type: optimized for normally distributed weights
  • Double quantization: compresses scale/offset constants themselves
  • Unified memory paging: seamless GPU↔CPU transfers when GPU OOM

Adapter Modules

Small feed-forward networks inserted after attention or FFN sublayers. Base model is frozen; only adapter weights train.

  • More modular than LoRA (can mix/match per task)
  • Slight inference overhead (adapters don't merge into base weights)
  • Useful for multi-task serving with shared base

When to Use Which

Method Best For Hardware
Full Fine-Tuning Maximum quality, small models (<7B) 8x A100 or equivalent
LoRA General fine-tuning, easy deployment 1–2x A100
QLoRA Large models (13B–70B), limited VRAM Single 24–48GB GPU
Adapters Multi-task serving, modular systems Similar to LoRA

Tooling

  • Hugging Face PEFT: canonical library; model.add_adapter() integrates with Transformers
  • bitsandbytes: 4-bit/8-bit quantization for QLoRA
  • Unsloth: 2x faster LoRA/QLoRA training with custom CUDA kernels
  • Axolotl: config-driven fine-tuning framework wrapping multiple methods

Distributed Inference and Scaling

Parallelism Strategies

Strategy What It Splits When to Use
Tensor Parallelism (TP) Individual layer weights across GPUs Model too large for one GPU
Pipeline Parallelism (PP) Sequential layers across GPUs/nodes Multi-node deployment
Data Parallelism (DP) Replicas of the full model High throughput, model fits one GPU
Expert Parallelism (EP) MoE experts across GPUs MoE models with many experts

NVIDIA Dynamo

Announced at GTC 2025, Dynamo is a distributed inference orchestration layer on top of vLLM/TensorRT-LLM/SGLang:

  • Disaggregated prefill and decode: separate GPU pools optimized for each phase
  • Coordinates work across GPU pools
  • Smart request routing based on KV cache locality

llm-d (Kubernetes-Native)

Launched May 2025 by Red Hat, Google Cloud, IBM, NVIDIA, and CoreWeave:

  • Kubernetes-native distributed LLM serving
  • Disaggregated prefill/decode stages
  • Gateway API Inference Extension for routing
  • Dynamic Resource Allocation (DRA) for GPU scheduling

Multi-Model Routing

For production deployments with multiple models:

Tool Purpose
LiteLLM Unified API gateway for 100+ LLM providers; fallback routing
Envoy AI Gateway Proxy-level routing, rate limiting, auth
OpenRouter Third-party multi-model API with cost optimization

Production Best Practices

Deployment Lifecycle

graph LR
    A[Prototype with Ollama] --> B[Validate with vLLM/SGLang]
    B --> C[Optimize: Quantization + Batching]
    C --> D[Load Test: Latency + Throughput]
    D --> E[Deploy: K8s + Autoscaling]
    E --> F[Monitor: Latency SLOs + Quality]

Checklist

Pre-Production Checklist

Model Selection

  • Benchmark candidate models on your actual task distribution
  • Test quantized variants (Q4_K_M, AWQ, FP8) against FP16 baseline
  • Validate edge cases: long inputs, multilingual, structured output

Infrastructure

  • Right-size GPU selection (H100 for throughput, A10G/L4 for cost, Apple Silicon for privacy)
  • Configure tensor parallelism if model exceeds single-GPU VRAM
  • Set up continuous batching with appropriate max batch size
  • Enable prefix caching for repetitive prompt patterns

Reliability

  • Deploy multiple replicas behind a load balancer
  • Configure autoscaling based on queue depth, not just CPU/GPU utilization
  • Set request timeouts and max token limits
  • Implement circuit breakers and fallback to smaller/cached models
  • Test failover by terminating instances under load

Monitoring

  • Track Time-to-First-Token (TTFT), tokens/second, and end-to-end latency at p50/p95/p99
  • Monitor GPU utilization, VRAM usage, and KV cache occupancy
  • Log prompt/response lengths for capacity planning
  • Set up alerts for latency SLO violations and OOM events

Quality

  • Implement output validation (JSON schema, safety filters)
  • Run periodic eval benchmarks against held-out test sets
  • Monitor for model drift after updates or quantization changes

Cost Optimization

Technique Savings Tradeoff
Quantization (FP16 → INT4) 70–75% VRAM, 2x speed Minor quality loss
Speculative decoding 1.5–3x throughput Draft model complexity
Prefix caching 30–60% for repetitive prompts Memory for cache storage
Request batching 3–10x throughput Slightly higher latency
Spot/preemptible instances 60–80% compute cost Requires graceful interruption handling
Model distillation 5–10x cheaper inference Upfront distillation cost

Common Pitfalls

Avoid These

  • Over-quantizing for your use case: Q2/Q3 works for casual chat but breaks agentic workflows, JSON output, and code generation
  • Ignoring TTFT: users perceive time-to-first-token as "speed" more than tokens/second
  • Static batching in production: wastes GPU cycles waiting for the longest request in the batch
  • No fallback strategy: a single model endpoint is a single point of failure
  • Benchmarking with synthetic data: real traffic patterns (variable lengths, bursty arrivals) behave very differently from uniform benchmarks
  • Skipping load testing: KV cache OOM under concurrent load is the most common production failure

CLI Recipes

Ollama (Local Inference)

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3:8b-q4_K_M
ollama run llama3:8b-q4_K_M

# List models
ollama list

# Serve as API
ollama serve  # default: http://localhost:11434
curl http://localhost:11434/api/generate -d '{"model":"llama3:8b-q4_K_M","prompt":"Hello"}'

vLLM (Production Serving)

# Install
pip install vllm

# Serve a model with tensor parallelism
vllm serve meta-llama/Llama-3-8B-Instruct \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 8192 \
  --port 8000

# OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'

MLX LM (Apple Silicon)

# Install
pip install mlx-lm

# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B \
  --quantize --q-bits 4 -o ./llama3-8b-4bit

# Generate
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit \
  --prompt "Explain transformers" --max-tokens 500

# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3-8B-4bit \
  --train --data ./train.jsonl --iters 1000

llama.cpp (GGUF Inference)

# Build
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release

# Run inference
./build/bin/llama-cli -m ./models/llama3-8b-q4_K_M.gguf \
  -p "What is quantization?" -n 256

# Start API server
./build/bin/llama-server -m ./models/llama3-8b-q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 -ngl 99  # -ngl: layers offloaded to GPU

Quantization with llama.cpp

# Convert HF model to GGUF
python convert_hf_to_gguf.py ./models/llama3-8b/ --outfile llama3-8b-f16.gguf

# Quantize
./build/bin/llama-quantize llama3-8b-f16.gguf llama3-8b-q4_K_M.gguf Q4_K_M

Fine-Tuning with QLoRA (Hugging Face)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
)

lora_config = LoraConfig(
    r=16,               # rank
    lora_alpha=32,       # scaling
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()
# → trainable: 0.2% of total