LLM Operations¶
Deployment, serving engines, inference optimization, fine-tuning, distributed scaling, and production best practices for LLMs.
Serving Engines¶
vLLM¶
Open-source inference engine optimized for high throughput. Its core innovation is PagedAttention, which eliminates 60–80% of memory waste from KV cache fragmentation.
- 14–24x higher throughput than Hugging Face Transformers
- 2.2–3.5x higher throughput than early TGI
- OpenAI-compatible API out of the box
- Continuous batching, prefix caching, SLA-aware scheduling
- Stripe: 73% inference cost reduction, 50M daily API calls on 1/3 the GPU fleet
Distributed parallelism in vLLM:
| Strategy | When to Use | Config |
|---|---|---|
| Tensor Parallelism (TP) | Model too large for one GPU, fits one node | tensor_parallel_size=4 |
| Pipeline Parallelism (PP) | Model too large for one node | pipeline_parallel_size=N_nodes |
| TP + PP combined | Multi-node, large models | Set both parameters |
Default runtime: Ray for multi-node, Python multiprocessing for single-node.
TensorRT-LLM¶
NVIDIA's specialized inference library. Uses CUDA graph optimizations, fused kernels, and Tensor Core acceleration.
- H100 + FP8: >10,000 output tok/s at 64 concurrent requests, ~100ms TTFT
- Requires upfront "engine build" step per model/GPU/precision configuration
- Highest raw performance on NVIDIA hardware but complex to set up
SGLang¶
High-performance serving with RadixAttention for aggressive KV cache reuse across requests. Best for:
- Agentic workflows with repeated prefixes
- RAG systems with shared context
- Multi-turn conversations
Ollama¶
Single-command local inference. Wraps llama.cpp with a clean CLI and REST API.
Best for: prototyping, local development, personal use. Not production-grade at scale.
LM Studio¶
GUI-based local inference for GGUF models on macOS/Windows/Linux. Download models from Hugging Face, run with one click. Similar audience as Ollama but with a visual interface.
Engine Selection Guide¶
| Scenario | Recommended Engine |
|---|---|
| Fast time-to-serve, OpenAI-compatible | vLLM |
| Absolute lowest latency on NVIDIA | TensorRT-LLM |
| Agentic / RAG with prefix sharing | SGLang |
| Long conversations | TGI v3 (prefix caching) |
| Local prototyping | Ollama or LM Studio |
| Apple Silicon | MLX LM or Ollama |
VRAM Estimation¶
Core Formula¶
$$ \text{VRAM}_{\text{total}} = \text{Weights} + \text{KV Cache} + \text{Activations} + \text{Overhead} $$
1. Model Weights¶
$$ \text{Weight Memory} = \text{Parameters} \times \text{Bytes per Parameter} $$
| Precision | Bytes/Param | 7B Model | 13B Model | 70B Model |
|---|---|---|---|---|
| FP32 | 4 | 28 GB | 52 GB | 280 GB |
| FP16 / BF16 | 2 | 14 GB | 26 GB | 140 GB |
| INT8 | 1 | 7 GB | 13 GB | 70 GB |
| INT4 (Q4) | 0.5 | 3.5 GB | 6.5 GB | 35 GB |
Quick rule of thumb: ~2 GB per 1B parameters at FP16, ~0.5 GB per 1B at INT4.
2. KV Cache¶
The KV cache is the hidden memory monster. It scales linearly with sequence length, batch size, and number of layers:
$$ \text{KV Cache} = 2 \times n_{\text{layers}} \times n_{\text{kv_heads}} \times d_{\text{head}} \times \text{seq_len} \times \text{batch} \times \text{bytes} $$
For LLaMA 3 70B with GQA (80 layers, 8 KV heads, 128 head dim): ~0.31 MB per token at BF16. Standard MHA would require ~2.5 MB per token (8x more).
KV Cache Gotcha
A model that fits comfortably at 2K context may OOM at 32K. Each 1,000 tokens adds ~0.11 GB for a 7B model, but for 70B with long context the KV cache can exceed the weight memory.
3. Activations and Overhead¶
- Activations: intermediate tensors during the forward pass; typically 5–20% of weight memory for inference
- Framework overhead: CUDA context, memory allocator, driver — 500 MB to 2 GB
Practical formula for inference:
$$ \text{VRAM}_{\text{inference}} \approx \text{Weight Memory} \times 1.2 + \text{KV Cache} $$
4. Training VRAM¶
Training needs ~4x inference memory due to gradients and optimizer states:
| Component | Memory (FP16 training) |
|---|---|
| Model weights | 2 bytes/param |
| Gradients | 2 bytes/param |
| Optimizer states (Adam) | 8 bytes/param (2x FP32 moments) |
| Activations | Variable (depends on batch size, checkpointing) |
| Total | ~16 GB per 1B params (rule of thumb) |
QLoRA reduces this to ~1 GB per 1B params by quantizing weights to 4-bit and training only LoRA adapters.
MoE Memory¶
All experts must reside in VRAM even though only top-K are active per token. DeepSeek-V3 (671B total) needs hundreds of GB even though only 37B params fire per token.
GPU Hardware Selection Guide¶
NVIDIA Data Center GPUs¶
| GPU | VRAM | Memory BW | FP16 TFLOPS | FP8 TFLOPS | Best For |
|---|---|---|---|---|---|
| H100 SXM | 80 GB HBM3 | 3.35 TB/s | 989 | 1,979 | Production serving, training |
| H100 NVL | 94 GB HBM3 | 3.9 TB/s | 989 | 1,979 | Large model inference |
| A100 80GB | 80 GB HBM2e | 2.0 TB/s | 312 | N/A | Previous gen workhorse |
| A100 40GB | 40 GB HBM2e | 1.6 TB/s | 312 | N/A | Budget production |
| L40S | 48 GB GDDR6 | 864 GB/s | 362 | 733 | Inference-optimized, cost-effective |
| A10G | 24 GB GDDR6 | 600 GB/s | 125 | N/A | Cloud inference (AWS g5) |
| L4 | 24 GB GDDR6 | 300 GB/s | 121 | 242 | Edge/cost-sensitive inference |
| B200 | 192 GB HBM3e | 8.0 TB/s | ~2,250 | ~4,500 | Next-gen Blackwell; 2025+ |
Consumer GPUs¶
| GPU | VRAM | Memory BW | Practical Use |
|---|---|---|---|
| RTX 4090 | 24 GB GDDR6X | 1,008 GB/s | Best consumer GPU for local LLMs; runs 13B Q4 comfortably |
| RTX 4080 | 16 GB | 717 GB/s | 7B Q4 with room for KV cache |
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | Popular for fine-tuning; great VRAM/dollar on used market |
| RTX 5090 | 32 GB GDDR7 | 1,790 GB/s | Best consumer option in 2025; runs 30B Q4 |
Apple Silicon¶
| Chip | Unified Memory | Memory BW | Notes |
|---|---|---|---|
| M4 Pro | 24–48 GB | 273 GB/s | Runs 8B BF16 or 30B Q4 |
| M4 Max | 36–128 GB | 546 GB/s | Runs 70B Q4 on 128GB config |
| M4 Ultra | 192–512 GB | 819 GB/s | Can run 405B Q4 or 671B MoE on 512GB |
| M5 | 24–48 GB | 153 GB/s | 19–27% faster than M4 for LLM inference |
Apple Silicon advantage: unified memory means the entire system RAM is available for model weights, not just dedicated VRAM. A 192GB M4 Ultra can load models that would require 3x H100s.
Which GPU for Which Task?¶
| Task | Recommended | Why |
|---|---|---|
| Local chat (personal use) | RTX 4090, M4 Pro/Max, RTX 5090 | 24–32GB VRAM handles 7B–13B easily |
| Fine-tuning (QLoRA, 7B–13B) | RTX 3090/4090 (24GB) | Sufficient VRAM for QLoRA |
| Fine-tuning (QLoRA, 70B) | 2x A100 80GB or 1x H100 | 70B QLoRA needs ~80GB |
| Production serving (<13B) | L4 or A10G | Cost-effective for smaller models |
| Production serving (70B+) | 2–4x H100 with TP | Tensor parallelism across GPUs |
| Research / large-scale training | H100/B200 clusters | Maximum throughput |
Inference Optimization¶
KV Cache¶
During autoregressive generation, each new token's attention computation requires the Keys and Values of all previous tokens. The KV cache stores these to avoid redundant recomputation.
Problem: KV cache grows linearly with sequence length and batch size, becoming the primary memory bottleneck for long-context inference.
Optimization techniques:
| Technique | Description | Impact |
|---|---|---|
| PagedAttention (vLLM) | Manages KV cache in non-contiguous pages, like OS virtual memory | Eliminates 60–80% memory waste |
| KV Cache Quantization | Compress KV cache to INT4/INT2/FP4 | NVIDIA NVFP4: <1% accuracy loss vs BF16 |
| Token Pruning | Evict low-attention tokens from cache | Reduces memory for ultra-long contexts |
| Head Fusion | Merge similar attention heads' KV entries | Reduces cache size for GQA models |
| Entropy-Guided Caching | Allocate more cache to high-entropy (broadly attending) heads | Better quality per memory byte |
| Static KV Cache | Pre-allocate fixed-size cache | Enables torch.compile for up to 4x speedup |
Speculative Decoding¶
Uses a small, fast draft model to generate candidate tokens, then verifies them in a single forward pass of the large model. Correct tokens are accepted for free.
graph LR
A[Draft Model - 1B] -->|Generate 5 candidate tokens| B[Large Model - 70B]
B -->|Verify in single pass| C{Accept / Reject}
C -->|Accepted tokens| D[Output]
C -->|Rejected| E[Revert to large model generation]
- Typically 1.5–3x speedup with no quality loss
- DEFT (ICLR 2025): tree-structured speculative decoding achieves 2.2–3.6x speedup
- Prompt Lookup Decoding: uses the prompt itself as the draft source
Flash Attention¶
Optimizes attention computation by minimizing GPU memory movement (HBM ↔ SRAM transfers). Standard attention materializes the full $N \times N$ attention matrix; Flash Attention tiles the computation to keep working data in fast SRAM.
- Flash Attention 2: 2x faster than Flash Attention 1
- Flash Attention 3 (July 2024): further optimizations for H100
- FlashInfer (MLSys 2025): customizable attention engine with JIT compilation, integrated into SGLang, vLLM, and MLC-Engine
Batching Strategies¶
| Strategy | Description | Best For |
|---|---|---|
| Static Batching | Fixed batch, all requests start/end together | Simple but wasteful |
| Continuous Batching | New requests join the batch as slots free up | Standard for production serving |
| Disaggregated Prefill/Decode | Separate GPU pools for prefill vs decode phases | Advanced; used by NVIDIA Dynamo |
Parameter-Efficient Fine-Tuning (PEFT)¶
Full fine-tuning updates all parameters — prohibitively expensive for large models. PEFT methods train <1% of parameters while retaining 90–95% of full fine-tuning quality.
LoRA (Low-Rank Adaptation)¶
Injects trainable low-rank matrices into each transformer layer while freezing original weights.
How it works:
For a weight matrix $W \in \mathbb{R}^{d \times k}$, instead of updating $W$ directly, LoRA adds:
$$ W' = W + \Delta W = W + BA $$
Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$ (typically 8–64).
| Property | Value |
|---|---|
| Trainable params | 0.2–0.3% of total |
| Adapter size | Few MB (vs GB for full model) |
| Inference cost | Zero — adapters merge into base weights |
| Task switching | Swap adapter files without reloading base model |
| Quality | Competitive with full fine-tuning for most tasks |
QLoRA (Quantized LoRA)¶
Loads the base model in 4-bit NormalFloat4 quantization while training LoRA adapters in higher precision:
- 75–80% memory reduction vs 16-bit LoRA
- Enables fine-tuning 65B models on a single 48GB GPU
- Quality on par with full 16-bit fine-tuning in many cases
Key innovations:
- NF4 data type: optimized for normally distributed weights
- Double quantization: compresses scale/offset constants themselves
- Unified memory paging: seamless GPU↔CPU transfers when GPU OOM
Adapter Modules¶
Small feed-forward networks inserted after attention or FFN sublayers. Base model is frozen; only adapter weights train.
- More modular than LoRA (can mix/match per task)
- Slight inference overhead (adapters don't merge into base weights)
- Useful for multi-task serving with shared base
When to Use Which¶
| Method | Best For | Hardware |
|---|---|---|
| Full Fine-Tuning | Maximum quality, small models (<7B) | 8x A100 or equivalent |
| LoRA | General fine-tuning, easy deployment | 1–2x A100 |
| QLoRA | Large models (13B–70B), limited VRAM | Single 24–48GB GPU |
| Adapters | Multi-task serving, modular systems | Similar to LoRA |
Tooling¶
- Hugging Face PEFT: canonical library;
model.add_adapter()integrates with Transformers - bitsandbytes: 4-bit/8-bit quantization for QLoRA
- Unsloth: 2x faster LoRA/QLoRA training with custom CUDA kernels
- Axolotl: config-driven fine-tuning framework wrapping multiple methods
Distributed Inference and Scaling¶
Parallelism Strategies¶
| Strategy | What It Splits | When to Use |
|---|---|---|
| Tensor Parallelism (TP) | Individual layer weights across GPUs | Model too large for one GPU |
| Pipeline Parallelism (PP) | Sequential layers across GPUs/nodes | Multi-node deployment |
| Data Parallelism (DP) | Replicas of the full model | High throughput, model fits one GPU |
| Expert Parallelism (EP) | MoE experts across GPUs | MoE models with many experts |
NVIDIA Dynamo¶
Announced at GTC 2025, Dynamo is a distributed inference orchestration layer on top of vLLM/TensorRT-LLM/SGLang:
- Disaggregated prefill and decode: separate GPU pools optimized for each phase
- Coordinates work across GPU pools
- Smart request routing based on KV cache locality
llm-d (Kubernetes-Native)¶
Launched May 2025 by Red Hat, Google Cloud, IBM, NVIDIA, and CoreWeave:
- Kubernetes-native distributed LLM serving
- Disaggregated prefill/decode stages
- Gateway API Inference Extension for routing
- Dynamic Resource Allocation (DRA) for GPU scheduling
Multi-Model Routing¶
For production deployments with multiple models:
| Tool | Purpose |
|---|---|
| LiteLLM | Unified API gateway for 100+ LLM providers; fallback routing |
| Envoy AI Gateway | Proxy-level routing, rate limiting, auth |
| OpenRouter | Third-party multi-model API with cost optimization |
Retrieval-Augmented Generation (RAG)¶
RAG allows LLMs to access external knowledge at inference time, reducing hallucination and enabling domain-specific responses without retraining.
Architecture¶
graph LR
A[User Query] --> B[Embedding Model]
B --> C[Vector Search]
D[Document Corpus] --> E[Chunking]
E --> F[Embedding Model]
F --> G[Vector Database]
C --> G
G --> H[Top-K Relevant Chunks]
H --> I[Augmented Prompt]
A --> I
I --> J[LLM]
J --> K[Grounded Response]
Pipeline Steps¶
| Step | What Happens | Key Decisions |
|---|---|---|
| 1. Document Preparation | Clean, parse, and normalize source documents | Format handling (PDF, HTML, markdown) |
| 2. Chunking | Split documents into retrieval units | Chunk size (256–1024 tokens), overlap, strategy |
| 3. Embedding | Convert chunks to dense vectors | Model choice (voyage-3-large, text-embedding-3-large) |
| 4. Indexing | Store vectors in a vector database | Database choice (Pinecone, Weaviate, Milvus, Qdrant) |
| 5. Retrieval | Find top-K chunks similar to the query | Similarity metric (cosine), hybrid search, K value |
| 6. Reranking | Re-score retrieved chunks for relevance | Cross-encoder reranker (Cohere, BGE, ColBERT) |
| 7. Augmentation | Inject chunks into the LLM prompt | Prompt template design, chunk ordering |
| 8. Generation | LLM produces a grounded answer | Citation generation, faithfulness checking |
Chunking Strategies¶
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split at N tokens with overlap | Simple, predictable |
| Recursive | Split by paragraph, then sentence, then word | General-purpose (LangChain default) |
| Semantic | Split when embedding similarity between adjacent segments drops | Better coherence; +9% recall over fixed |
| Document-structure | Split by headings, sections, markdown structure | Structured documents (docs, wikis) |
| Proposition-based | Extract atomic facts as individual chunks | Highest precision, expensive to compute |
Vector Databases¶
| Database | Strength | Scale | Managed Option |
|---|---|---|---|
| Pinecone | Zero-ops managed service | Billions | Yes (primary) |
| Weaviate | Hybrid search + knowledge graph | Millions–Billions | Yes |
| Milvus | Billion-scale, distributed | Billions | Zilliz Cloud |
| Qdrant | Complex metadata filtering | Millions–Billions | Yes |
| Chroma | Developer-friendly, lightweight | Millions | No (self-hosted) |
| pgvector | PostgreSQL extension; no new infra | Millions | Via any Postgres host |
Hybrid Search and Reranking¶
Combining dense (semantic) and sparse (lexical/BM25) retrieval improves results — dense search handles paraphrases and meaning, while sparse search catches exact terms, acronyms, and domain jargon.
After initial retrieval, a cross-encoder reranker re-scores each (query, chunk) pair, adding 10–30% precision improvement at 50–100ms latency cost.
RAG vs Fine-Tuning¶
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge update | Instant (swap documents) | Requires retraining |
| Cost | Low (no training) | High (GPU hours) |
| Hallucination | Reduced (grounded in sources) | Can still hallucinate |
| Latency | Higher (retrieval + generation) | Lower (single forward pass) |
| Best for | Factual Q&A, documents, knowledge bases | Style/format changes, task specialization |
RAG can cut fine-tuning spend by 60–80% by delivering domain knowledge through retrieval rather than parameter updates.
Evaluation Benchmarks¶
Major Benchmarks¶
| Benchmark | What It Tests | Format | Status (2025) |
|---|---|---|---|
| MMLU | General knowledge (57 subjects) | Multiple choice | Saturated; use MMLU-Pro |
| MMLU-Pro | Harder MMLU with more choices | Multiple choice | Current knowledge standard |
| HumanEval | Code generation (Python) | Generate code + unit tests | Top models >90% |
| SWE-bench | Real software engineering (bugs in repos) | Navigate codebase, write patches | Gold standard for coding |
| LiveCodeBench | Rolling code challenges | Competitive programming | Contamination-resistant |
| GSM8K | Grade school math (2–8 steps) | Chain-of-thought reasoning | Saturated; contamination concerns |
| MATH / MATH-500 | Competition math (AMC/AIME) | Multi-step symbolic reasoning | DeepSeek-R1 at 97.3% |
| GPQA | Graduate-level science | Expert-written multiple choice | Hard frontier benchmark |
| HLE (Humanity's Last Exam) | 2,500 expert questions | Multi-modal | Very hard; even frontier LLMs score low |
| Arena ELO (Chatbot Arena) | Human preference ranking | Blind A/B voting | Most trusted overall quality ranking |
| IFEval | Instruction following | Format/constraint verification | Tests structured output compliance |
Benchmark Pitfalls¶
Use Benchmarks Carefully
- Saturation: MMLU, GSM8K, HumanEval are largely solved — score differences at 90%+ are often noise
- Contamination: up to 13% accuracy drops on GSM8K when removing training-overlap examples
- Task mismatch: match benchmarks to your actual use case — MMLU ≠ code, HumanEval ≠ real engineering
- Arena ELO: most trusted overall signal but biased toward chatbot-style tasks
Structured Output and Constrained Decoding¶
Ensuring LLMs produce valid JSON, function calls, or other structured formats.
Approaches¶
| Method | Guarantee | How It Works |
|---|---|---|
| Prompt engineering | Best-effort | Describe desired format in prompt |
| JSON mode (API) | Schema-conformant | Provider constrains output to valid JSON |
| Constrained decoding | 100% conformant | Mask invalid tokens at each step using grammar/schema |
| Fine-tuning | High but not guaranteed | Train on structured input-output pairs |
How Constrained Decoding Works¶
A logit processor sits between the model's output and sampling. It tracks position within the target grammar (JSON Schema, regex, EBNF) and sets invalid token logits to $-\infty$:
$$ P'(t) = \text{normalize}(P(t) \odot \text{mask}(t)) $$
Key Libraries¶
| Library | Approach | Overhead | Integration |
|---|---|---|---|
| XGrammar | Pushdown automaton; precomputed bitmasks | Near-zero (~1%) | vLLM, SGLang, MLC |
| llguidance (Microsoft) | Grammar-based; credited by OpenAI | Low | Guidance, OpenAI API |
| Outlines | Regex/CFG masking | Low-moderate | Hugging Face, vLLM |
| Guidance (Microsoft) | Template-based with token healing | Low | Standalone |
vLLM (0.8.5+) supports structural tags — constrain only parts of the output. The model generates free-form text, switches into JSON-constrained tool calls, then back. Critical for agentic workflows.
Safety, Guardrails, and Content Filtering¶
Guardrail Frameworks¶
| Framework | Provider | Approach |
|---|---|---|
| NeMo Guardrails | NVIDIA | Programmable rails via Colang language; input/output/dialog/retrieval rails |
| Llama Guard | Meta | LLM-based classifier; categorizes prompts as safe/unsafe |
| Guardrails AI | Open source | Validator pipeline; JSON schema, toxicity, PII redaction |
| Azure AI Content Safety | Microsoft | Cloud API; real-time content classification |
| Lakera | Third-party | Specialized prompt injection detection |
Types of Rails¶
| Rail Type | When | What |
|---|---|---|
| Input rails | Before LLM processes request | Reject harmful prompts, mask PII, detect injection |
| Output rails | After LLM generates response | Filter toxic content, validate format |
| Dialog rails | During multi-turn conversation | Enforce flow, prevent topic drift |
| Retrieval rails | In RAG pipelines | Filter harmful retrieved chunks |
Prompt Injection (2025)¶
The top LLM security concern. 2025 research revealed:
- Attack success rates of 72–92% against several guardrail systems
- Emoji smuggling achieved 100% bypass rate
- Multi-layered defense is the only viable approach
Defense in Depth
No single guardrail is sufficient. Combine: (1) input classification (Llama Guard), (2) output filtering (toxicity, PII), (3) rate limiting + anomaly detection, (4) human review for high-stakes decisions.
Additional PEFT Methods¶
Beyond LoRA, QLoRA, and adapters:
| Method | How It Works | Trainable Params | Best For |
|---|---|---|---|
| DoRA (Weight-Decomposed LRA) | Decomposes weights into magnitude + direction; LoRA on direction only | ~same as LoRA | Better quality at same rank; drop-in LoRA replacement |
| Prefix Tuning | Prepends trainable "virtual tokens" to each layer's K and V | ~0.1% | Few-shot task adaptation |
| Prompt Tuning | Adds trainable embeddings to input only (not each layer) | ~0.01% | Extremely lightweight; best for classification |
| IA3 | Learns rescaling vectors for K, V, and FFN activations | ~0.01% | Few-shot with minimal parameters |
Fine-Tuning Data Preparation¶
| Aspect | Recommendation |
|---|---|
| Format | JSONL with instruction/input/output (Alpaca) or messages array (ChatML/ShareGPT) |
| Quality over quantity | 1,000 high-quality examples often outperform 100,000 noisy ones |
| Diversity | Cover the range of expected inputs; avoid over-representing any pattern |
| Decontamination | Remove examples overlapping with evaluation benchmarks |
| Minimum size | LoRA/QLoRA: 500–5K for task-specific; 10K–100K for general instruction tuning |
| Validation split | Hold out 5–10%; monitor loss for overfitting |
Fine-Tuning Pipeline¶
graph LR
A[Collect/Generate Data] --> B[Format to JSONL]
B --> C[Decontaminate & Deduplicate]
C --> D[Train/Val Split]
D --> E[Choose Method: LoRA / QLoRA / Full]
E --> F[Train with Early Stopping]
F --> G[Evaluate on Held-Out Set + Benchmarks]
G --> H[Merge Adapter into Base]
H --> I[Deploy]
Production Best Practices¶
Deployment Lifecycle¶
graph LR
A[Prototype with Ollama] --> B[Validate with vLLM/SGLang]
B --> C[Optimize: Quantization + Batching]
C --> D[Load Test: Latency + Throughput]
D --> E[Deploy: K8s + Autoscaling]
E --> F[Monitor: Latency SLOs + Quality]
Checklist¶
Pre-Production Checklist
Model Selection
- Benchmark candidate models on your actual task distribution
- Test quantized variants (Q4_K_M, AWQ, FP8) against FP16 baseline
- Validate edge cases: long inputs, multilingual, structured output
Infrastructure
- Right-size GPU selection (H100 for throughput, A10G/L4 for cost, Apple Silicon for privacy)
- Configure tensor parallelism if model exceeds single-GPU VRAM
- Set up continuous batching with appropriate max batch size
- Enable prefix caching for repetitive prompt patterns
Reliability
- Deploy multiple replicas behind a load balancer
- Configure autoscaling based on queue depth, not just CPU/GPU utilization
- Set request timeouts and max token limits
- Implement circuit breakers and fallback to smaller/cached models
- Test failover by terminating instances under load
Monitoring
- Track Time-to-First-Token (TTFT), tokens/second, and end-to-end latency at p50/p95/p99
- Monitor GPU utilization, VRAM usage, and KV cache occupancy
- Log prompt/response lengths for capacity planning
- Set up alerts for latency SLO violations and OOM events
Quality
- Implement output validation (JSON schema, safety filters)
- Run periodic eval benchmarks against held-out test sets
- Monitor for model drift after updates or quantization changes
Cost Optimization¶
| Technique | Savings | Tradeoff |
|---|---|---|
| Quantization (FP16 → INT4) | 70–75% VRAM, 2x speed | Minor quality loss |
| Speculative decoding | 1.5–3x throughput | Draft model complexity |
| Prefix caching | 30–60% for repetitive prompts | Memory for cache storage |
| Request batching | 3–10x throughput | Slightly higher latency |
| Spot/preemptible instances | 60–80% compute cost | Requires graceful interruption handling |
| Model distillation | 5–10x cheaper inference | Upfront distillation cost |
Common Pitfalls¶
Avoid These
- Over-quantizing for your use case: Q2/Q3 works for casual chat but breaks agentic workflows, JSON output, and code generation
- Ignoring TTFT: users perceive time-to-first-token as "speed" more than tokens/second
- Static batching in production: wastes GPU cycles waiting for the longest request in the batch
- No fallback strategy: a single model endpoint is a single point of failure
- Benchmarking with synthetic data: real traffic patterns (variable lengths, bursty arrivals) behave very differently from uniform benchmarks
- Skipping load testing: KV cache OOM under concurrent load is the most common production failure
CLI Recipes¶
Ollama (Local Inference)¶
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3:8b-q4_K_M
ollama run llama3:8b-q4_K_M
# List models
ollama list
# Serve as API
ollama serve # default: http://localhost:11434
curl http://localhost:11434/api/generate -d '{"model":"llama3:8b-q4_K_M","prompt":"Hello"}'
vLLM (Production Serving)¶
# Install
pip install vllm
# Serve a model with tensor parallelism
vllm serve meta-llama/Llama-3-8B-Instruct \
--tensor-parallel-size 2 \
--quantization awq \
--max-model-len 8192 \
--port 8000
# OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'
MLX LM (Apple Silicon)¶
# Install
pip install mlx-lm
# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B \
--quantize --q-bits 4 -o ./llama3-8b-4bit
# Generate
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit \
--prompt "Explain transformers" --max-tokens 500
# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3-8B-4bit \
--train --data ./train.jsonl --iters 1000
llama.cpp (GGUF Inference)¶
# Build
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release
# Run inference
./build/bin/llama-cli -m ./models/llama3-8b-q4_K_M.gguf \
-p "What is quantization?" -n 256
# Start API server
./build/bin/llama-server -m ./models/llama3-8b-q4_K_M.gguf \
--host 0.0.0.0 --port 8080 -ngl 99 # -ngl: layers offloaded to GPU
Quantization with llama.cpp¶
# Convert HF model to GGUF
python convert_hf_to_gguf.py ./models/llama3-8b/ --outfile llama3-8b-f16.gguf
# Quantize
./build/bin/llama-quantize llama3-8b-f16.gguf llama3-8b-q4_K_M.gguf Q4_K_M
Fine-Tuning with QLoRA (Hugging Face)¶
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
)
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()
# → trainable: 0.2% of total