LLM Operations¶

Deployment, serving engines, inference optimization, fine-tuning, distributed scaling, and production best practices for LLMs.

Serving Engines¶

vLLM¶

Open-source inference engine optimized for high throughput. Its core innovation is PagedAttention, which eliminates 60–80% of memory waste from KV cache fragmentation.

14–24x higher throughput than Hugging Face Transformers
2.2–3.5x higher throughput than early TGI
OpenAI-compatible API out of the box
Continuous batching, prefix caching, SLA-aware scheduling
Stripe: 73% inference cost reduction, 50M daily API calls on 1/3 the GPU fleet

Distributed parallelism in vLLM:

Strategy	When to Use	Config
Tensor Parallelism (TP)	Model too large for one GPU, fits one node	`tensor_parallel_size=4`
Pipeline Parallelism (PP)	Model too large for one node	`pipeline_parallel_size=N_nodes`
TP + PP combined	Multi-node, large models	Set both parameters

Default runtime: Ray for multi-node, Python multiprocessing for single-node.

TensorRT-LLM¶

NVIDIA's specialized inference library. Uses CUDA graph optimizations, fused kernels, and Tensor Core acceleration.

H100 + FP8: >10,000 output tok/s at 64 concurrent requests, ~100ms TTFT
Requires upfront "engine build" step per model/GPU/precision configuration
Highest raw performance on NVIDIA hardware but complex to set up

SGLang¶

High-performance serving with RadixAttention for aggressive KV cache reuse across requests. Best for:

Agentic workflows with repeated prefixes
RAG systems with shared context
Multi-turn conversations

Ollama¶

Single-command local inference. Wraps llama.cpp with a clean CLI and REST API.

ollama pull llama3:8b-q4_K_M
ollama run llama3:8b-q4_K_M "What is attention?"

Best for: prototyping, local development, personal use. Not production-grade at scale.

LM Studio¶

GUI-based local inference for GGUF models on macOS/Windows/Linux. Download models from Hugging Face, run with one click. Similar audience as Ollama but with a visual interface.

Engine Selection Guide¶

Scenario	Recommended Engine
Fast time-to-serve, OpenAI-compatible	vLLM
Absolute lowest latency on NVIDIA	TensorRT-LLM
Agentic / RAG with prefix sharing	SGLang
Long conversations	TGI v3 (prefix caching)
Local prototyping	Ollama or LM Studio
Apple Silicon	MLX LM or Ollama

VRAM Estimation¶

Core Formula¶

$$ \text{VRAM}_{\text{total}} = \text{Weights} + \text{KV Cache} + \text{Activations} + \text{Overhead} $$

1. Model Weights¶

$$ \text{Weight Memory} = \text{Parameters} \times \text{Bytes per Parameter} $$

Precision	Bytes/Param	7B Model	13B Model	70B Model
FP32	4	28 GB	52 GB	280 GB
FP16 / BF16	2	14 GB	26 GB	140 GB
INT8	1	7 GB	13 GB	70 GB
INT4 (Q4)	0.5	3.5 GB	6.5 GB	35 GB

Quick rule of thumb: ~2 GB per 1B parameters at FP16, ~0.5 GB per 1B at INT4.

2. KV Cache¶

The KV cache is the hidden memory monster. It scales linearly with sequence length, batch size, and number of layers:

$$ \text{KV Cache} = 2 \times n_{\text{layers}} \times n_{\text{kv_heads}} \times d_{\text{head}} \times \text{seq_len} \times \text{batch} \times \text{bytes} $$

For LLaMA 3 70B with GQA (80 layers, 8 KV heads, 128 head dim): ~0.31 MB per token at BF16. Standard MHA would require ~2.5 MB per token (8x more).

KV Cache Gotcha

A model that fits comfortably at 2K context may OOM at 32K. Each 1,000 tokens adds ~0.11 GB for a 7B model, but for 70B with long context the KV cache can exceed the weight memory.

3. Activations and Overhead¶

Activations: intermediate tensors during the forward pass; typically 5–20% of weight memory for inference
Framework overhead: CUDA context, memory allocator, driver — 500 MB to 2 GB

Practical formula for inference:

$$ \text{VRAM}_{\text{inference}} \approx \text{Weight Memory} \times 1.2 + \text{KV Cache} $$

4. Training VRAM¶

Training needs ~4x inference memory due to gradients and optimizer states:

Component	Memory (FP16 training)
Model weights	2 bytes/param
Gradients	2 bytes/param
Optimizer states (Adam)	8 bytes/param (2x FP32 moments)
Activations	Variable (depends on batch size, checkpointing)
Total	~16 GB per 1B params (rule of thumb)

QLoRA reduces this to ~1 GB per 1B params by quantizing weights to 4-bit and training only LoRA adapters.

MoE Memory¶

All experts must reside in VRAM even though only top-K are active per token. DeepSeek-V3 (671B total) needs hundreds of GB even though only 37B params fire per token.

GPU Hardware Selection Guide¶

NVIDIA Data Center GPUs¶

GPU	VRAM	Memory BW	FP16 TFLOPS	FP8 TFLOPS	Best For
H100 SXM	80 GB HBM3	3.35 TB/s	989	1,979	Production serving, training
H100 NVL	94 GB HBM3	3.9 TB/s	989	1,979	Large model inference
A100 80GB	80 GB HBM2e	2.0 TB/s	312	N/A	Previous gen workhorse
A100 40GB	40 GB HBM2e	1.6 TB/s	312	N/A	Budget production
L40S	48 GB GDDR6	864 GB/s	362	733	Inference-optimized, cost-effective
A10G	24 GB GDDR6	600 GB/s	125	N/A	Cloud inference (AWS g5)
L4	24 GB GDDR6	300 GB/s	121	242	Edge/cost-sensitive inference
B200	192 GB HBM3e	8.0 TB/s	~2,250	~4,500	Next-gen Blackwell; 2025+

Consumer GPUs¶

GPU	VRAM	Memory BW	Practical Use
RTX 4090	24 GB GDDR6X	1,008 GB/s	Best consumer GPU for local LLMs; runs 13B Q4 comfortably
RTX 4080	16 GB	717 GB/s	7B Q4 with room for KV cache
RTX 3090	24 GB GDDR6X	936 GB/s	Popular for fine-tuning; great VRAM/dollar on used market
RTX 5090	32 GB GDDR7	1,790 GB/s	Best consumer option in 2025; runs 30B Q4

Apple Silicon¶

Chip	Unified Memory	Memory BW	Notes
M4 Pro	24–48 GB	273 GB/s	Runs 8B BF16 or 30B Q4
M4 Max	36–128 GB	546 GB/s	Runs 70B Q4 on 128GB config
M4 Ultra	192–512 GB	819 GB/s	Can run 405B Q4 or 671B MoE on 512GB
M5	24–48 GB	153 GB/s	19–27% faster than M4 for LLM inference

Apple Silicon advantage: unified memory means the entire system RAM is available for model weights, not just dedicated VRAM. A 192GB M4 Ultra can load models that would require 3x H100s.

Which GPU for Which Task?¶

Task	Recommended	Why
Local chat (personal use)	RTX 4090, M4 Pro/Max, RTX 5090	24–32GB VRAM handles 7B–13B easily
Fine-tuning (QLoRA, 7B–13B)	RTX 3090/4090 (24GB)	Sufficient VRAM for QLoRA
Fine-tuning (QLoRA, 70B)	2x A100 80GB or 1x H100	70B QLoRA needs ~80GB
Production serving (<13B)	L4 or A10G	Cost-effective for smaller models
Production serving (70B+)	2–4x H100 with TP	Tensor parallelism across GPUs
Research / large-scale training	H100/B200 clusters	Maximum throughput

Inference Optimization¶

KV Cache¶

During autoregressive generation, each new token's attention computation requires the Keys and Values of all previous tokens. The KV cache stores these to avoid redundant recomputation.

Problem: KV cache grows linearly with sequence length and batch size, becoming the primary memory bottleneck for long-context inference.

Optimization techniques:

Technique	Description	Impact
PagedAttention (vLLM)	Manages KV cache in non-contiguous pages, like OS virtual memory	Eliminates 60–80% memory waste
KV Cache Quantization	Compress KV cache to INT4/INT2/FP4	NVIDIA NVFP4: <1% accuracy loss vs BF16
Token Pruning	Evict low-attention tokens from cache	Reduces memory for ultra-long contexts
Head Fusion	Merge similar attention heads' KV entries	Reduces cache size for GQA models
Entropy-Guided Caching	Allocate more cache to high-entropy (broadly attending) heads	Better quality per memory byte
Static KV Cache	Pre-allocate fixed-size cache	Enables torch.compile for up to 4x speedup

Speculative Decoding¶

Uses a small, fast draft model to generate candidate tokens, then verifies them in a single forward pass of the large model. Correct tokens are accepted for free.

graph LR
    A[Draft Model - 1B] -->|Generate 5 candidate tokens| B[Large Model - 70B]
    B -->|Verify in single pass| C{Accept / Reject}
    C -->|Accepted tokens| D[Output]
    C -->|Rejected| E[Revert to large model generation]

Typically 1.5–3x speedup with no quality loss
DEFT (ICLR 2025): tree-structured speculative decoding achieves 2.2–3.6x speedup
Prompt Lookup Decoding: uses the prompt itself as the draft source

Flash Attention¶

Optimizes attention computation by minimizing GPU memory movement (HBM ↔ SRAM transfers). Standard attention materializes the full $N \times N$ attention matrix; Flash Attention tiles the computation to keep working data in fast SRAM.

Flash Attention 2: 2x faster than Flash Attention 1
Flash Attention 3 (July 2024): further optimizations for H100
FlashInfer (MLSys 2025): customizable attention engine with JIT compilation, integrated into SGLang, vLLM, and MLC-Engine

Batching Strategies¶

Strategy	Description	Best For
Static Batching	Fixed batch, all requests start/end together	Simple but wasteful
Continuous Batching	New requests join the batch as slots free up	Standard for production serving
Disaggregated Prefill/Decode	Separate GPU pools for prefill vs decode phases	Advanced; used by NVIDIA Dynamo

Parameter-Efficient Fine-Tuning (PEFT)¶

Full fine-tuning updates all parameters — prohibitively expensive for large models. PEFT methods train <1% of parameters while retaining 90–95% of full fine-tuning quality.

LoRA (Low-Rank Adaptation)¶

Injects trainable low-rank matrices into each transformer layer while freezing original weights.

How it works:

For a weight matrix $W \in \mathbb{R}^{d \times k}$, instead of updating $W$ directly, LoRA adds:

$$ W' = W + \Delta W = W + BA $$

Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$ (typically 8–64).

Property	Value
Trainable params	0.2–0.3% of total
Adapter size	Few MB (vs GB for full model)
Inference cost	Zero — adapters merge into base weights
Task switching	Swap adapter files without reloading base model
Quality	Competitive with full fine-tuning for most tasks

QLoRA (Quantized LoRA)¶

Loads the base model in 4-bit NormalFloat4 quantization while training LoRA adapters in higher precision:

75–80% memory reduction vs 16-bit LoRA
Enables fine-tuning 65B models on a single 48GB GPU
Quality on par with full 16-bit fine-tuning in many cases

Key innovations:

NF4 data type: optimized for normally distributed weights
Double quantization: compresses scale/offset constants themselves
Unified memory paging: seamless GPU↔CPU transfers when GPU OOM

Adapter Modules¶

Small feed-forward networks inserted after attention or FFN sublayers. Base model is frozen; only adapter weights train.

More modular than LoRA (can mix/match per task)
Slight inference overhead (adapters don't merge into base weights)
Useful for multi-task serving with shared base

When to Use Which¶

Method	Best For	Hardware
Full Fine-Tuning	Maximum quality, small models (<7B)	8x A100 or equivalent
LoRA	General fine-tuning, easy deployment	1–2x A100
QLoRA	Large models (13B–70B), limited VRAM	Single 24–48GB GPU
Adapters	Multi-task serving, modular systems	Similar to LoRA

Tooling¶

Hugging Face PEFT: canonical library; model.add_adapter() integrates with Transformers
bitsandbytes: 4-bit/8-bit quantization for QLoRA
Unsloth: 2x faster LoRA/QLoRA training with custom CUDA kernels
Axolotl: config-driven fine-tuning framework wrapping multiple methods

Distributed Inference and Scaling¶

Parallelism Strategies¶

Strategy	What It Splits	When to Use
Tensor Parallelism (TP)	Individual layer weights across GPUs	Model too large for one GPU
Pipeline Parallelism (PP)	Sequential layers across GPUs/nodes	Multi-node deployment
Data Parallelism (DP)	Replicas of the full model	High throughput, model fits one GPU
Expert Parallelism (EP)	MoE experts across GPUs	MoE models with many experts

NVIDIA Dynamo¶

Announced at GTC 2025, Dynamo is a distributed inference orchestration layer on top of vLLM/TensorRT-LLM/SGLang:

Disaggregated prefill and decode: separate GPU pools optimized for each phase
Coordinates work across GPU pools
Smart request routing based on KV cache locality

llm-d (Kubernetes-Native)¶

Launched May 2025 by Red Hat, Google Cloud, IBM, NVIDIA, and CoreWeave:

Kubernetes-native distributed LLM serving
Disaggregated prefill/decode stages
Gateway API Inference Extension for routing
Dynamic Resource Allocation (DRA) for GPU scheduling

Multi-Model Routing¶

For production deployments with multiple models:

Tool	Purpose
LiteLLM	Unified API gateway for 100+ LLM providers; fallback routing
Envoy AI Gateway	Proxy-level routing, rate limiting, auth
OpenRouter	Third-party multi-model API with cost optimization

Retrieval-Augmented Generation (RAG)¶

RAG allows LLMs to access external knowledge at inference time, reducing hallucination and enabling domain-specific responses without retraining.

Architecture¶

graph LR
    A[User Query] --> B[Embedding Model]
    B --> C[Vector Search]
    D[Document Corpus] --> E[Chunking]
    E --> F[Embedding Model]
    F --> G[Vector Database]
    C --> G
    G --> H[Top-K Relevant Chunks]
    H --> I[Augmented Prompt]
    A --> I
    I --> J[LLM]
    J --> K[Grounded Response]

Pipeline Steps¶

Step	What Happens	Key Decisions
1. Document Preparation	Clean, parse, and normalize source documents	Format handling (PDF, HTML, markdown)
2. Chunking	Split documents into retrieval units	Chunk size (256–1024 tokens), overlap, strategy
3. Embedding	Convert chunks to dense vectors	Model choice (voyage-3-large, text-embedding-3-large)
4. Indexing	Store vectors in a vector database	Database choice (Pinecone, Weaviate, Milvus, Qdrant)
5. Retrieval	Find top-K chunks similar to the query	Similarity metric (cosine), hybrid search, K value
6. Reranking	Re-score retrieved chunks for relevance	Cross-encoder reranker (Cohere, BGE, ColBERT)
7. Augmentation	Inject chunks into the LLM prompt	Prompt template design, chunk ordering
8. Generation	LLM produces a grounded answer	Citation generation, faithfulness checking

Chunking Strategies¶

Strategy	Description	Best For
Fixed-size	Split at N tokens with overlap	Simple, predictable
Recursive	Split by paragraph, then sentence, then word	General-purpose (LangChain default)
Semantic	Split when embedding similarity between adjacent segments drops	Better coherence; +9% recall over fixed
Document-structure	Split by headings, sections, markdown structure	Structured documents (docs, wikis)
Proposition-based	Extract atomic facts as individual chunks	Highest precision, expensive to compute

Vector Databases¶

Database	Strength	Scale	Managed Option
Pinecone	Zero-ops managed service	Billions	Yes (primary)
Weaviate	Hybrid search + knowledge graph	Millions–Billions	Yes
Milvus	Billion-scale, distributed	Billions	Zilliz Cloud
Qdrant	Complex metadata filtering	Millions–Billions	Yes
Chroma	Developer-friendly, lightweight	Millions	No (self-hosted)
pgvector	PostgreSQL extension; no new infra	Millions	Via any Postgres host

Hybrid Search and Reranking¶

Combining dense (semantic) and sparse (lexical/BM25) retrieval improves results — dense search handles paraphrases and meaning, while sparse search catches exact terms, acronyms, and domain jargon.

After initial retrieval, a cross-encoder reranker re-scores each (query, chunk) pair, adding 10–30% precision improvement at 50–100ms latency cost.

RAG vs Fine-Tuning¶

Dimension	RAG	Fine-Tuning
Knowledge update	Instant (swap documents)	Requires retraining
Cost	Low (no training)	High (GPU hours)
Hallucination	Reduced (grounded in sources)	Can still hallucinate
Latency	Higher (retrieval + generation)	Lower (single forward pass)
Best for	Factual Q&A, documents, knowledge bases	Style/format changes, task specialization

RAG can cut fine-tuning spend by 60–80% by delivering domain knowledge through retrieval rather than parameter updates.

Evaluation Benchmarks¶

Major Benchmarks¶

Benchmark	What It Tests	Format	Status (2025)
MMLU	General knowledge (57 subjects)	Multiple choice	Saturated; use MMLU-Pro
MMLU-Pro	Harder MMLU with more choices	Multiple choice	Current knowledge standard
HumanEval	Code generation (Python)	Generate code + unit tests	Top models >90%
SWE-bench	Real software engineering (bugs in repos)	Navigate codebase, write patches	Gold standard for coding
LiveCodeBench	Rolling code challenges	Competitive programming	Contamination-resistant
GSM8K	Grade school math (2–8 steps)	Chain-of-thought reasoning	Saturated; contamination concerns
MATH / MATH-500	Competition math (AMC/AIME)	Multi-step symbolic reasoning	DeepSeek-R1 at 97.3%
GPQA	Graduate-level science	Expert-written multiple choice	Hard frontier benchmark
HLE (Humanity's Last Exam)	2,500 expert questions	Multi-modal	Very hard; even frontier LLMs score low
Arena ELO (Chatbot Arena)	Human preference ranking	Blind A/B voting	Most trusted overall quality ranking
IFEval	Instruction following	Format/constraint verification	Tests structured output compliance

Benchmark Pitfalls¶

Use Benchmarks Carefully

Saturation: MMLU, GSM8K, HumanEval are largely solved — score differences at 90%+ are often noise
Contamination: up to 13% accuracy drops on GSM8K when removing training-overlap examples
Task mismatch: match benchmarks to your actual use case — MMLU ≠ code, HumanEval ≠ real engineering
Arena ELO: most trusted overall signal but biased toward chatbot-style tasks

Structured Output and Constrained Decoding¶

Ensuring LLMs produce valid JSON, function calls, or other structured formats.

Approaches¶

Method	Guarantee	How It Works
Prompt engineering	Best-effort	Describe desired format in prompt
JSON mode (API)	Schema-conformant	Provider constrains output to valid JSON
Constrained decoding	100% conformant	Mask invalid tokens at each step using grammar/schema
Fine-tuning	High but not guaranteed	Train on structured input-output pairs

How Constrained Decoding Works¶

A logit processor sits between the model's output and sampling. It tracks position within the target grammar (JSON Schema, regex, EBNF) and sets invalid token logits to $-\infty$:

$$ P'(t) = \text{normalize}(P(t) \odot \text{mask}(t)) $$

Key Libraries¶

Library	Approach	Overhead	Integration
XGrammar	Pushdown automaton; precomputed bitmasks	Near-zero (~1%)	vLLM, SGLang, MLC
llguidance (Microsoft)	Grammar-based; credited by OpenAI	Low	Guidance, OpenAI API
Outlines	Regex/CFG masking	Low-moderate	Hugging Face, vLLM
Guidance (Microsoft)	Template-based with token healing	Low	Standalone

vLLM (0.8.5+) supports structural tags — constrain only parts of the output. The model generates free-form text, switches into JSON-constrained tool calls, then back. Critical for agentic workflows.

Safety, Guardrails, and Content Filtering¶

Guardrail Frameworks¶

Framework	Provider	Approach
NeMo Guardrails	NVIDIA	Programmable rails via Colang language; input/output/dialog/retrieval rails
Llama Guard	Meta	LLM-based classifier; categorizes prompts as safe/unsafe
Guardrails AI	Open source	Validator pipeline; JSON schema, toxicity, PII redaction
Azure AI Content Safety	Microsoft	Cloud API; real-time content classification
Lakera	Third-party	Specialized prompt injection detection

Types of Rails¶

Rail Type	When	What
Input rails	Before LLM processes request	Reject harmful prompts, mask PII, detect injection
Output rails	After LLM generates response	Filter toxic content, validate format
Dialog rails	During multi-turn conversation	Enforce flow, prevent topic drift
Retrieval rails	In RAG pipelines	Filter harmful retrieved chunks

Prompt Injection (2025)¶

The top LLM security concern. 2025 research revealed:

Attack success rates of 72–92% against several guardrail systems
Emoji smuggling achieved 100% bypass rate
Multi-layered defense is the only viable approach

Defense in Depth

No single guardrail is sufficient. Combine: (1) input classification (Llama Guard), (2) output filtering (toxicity, PII), (3) rate limiting + anomaly detection, (4) human review for high-stakes decisions.

Additional PEFT Methods¶

Beyond LoRA, QLoRA, and adapters:

Method	How It Works	Trainable Params	Best For
DoRA (Weight-Decomposed LRA)	Decomposes weights into magnitude + direction; LoRA on direction only	~same as LoRA	Better quality at same rank; drop-in LoRA replacement
Prefix Tuning	Prepends trainable "virtual tokens" to each layer's K and V	~0.1%	Few-shot task adaptation
Prompt Tuning	Adds trainable embeddings to input only (not each layer)	~0.01%	Extremely lightweight; best for classification
IA3	Learns rescaling vectors for K, V, and FFN activations	~0.01%	Few-shot with minimal parameters

Fine-Tuning Data Preparation¶

Aspect	Recommendation
Format	JSONL with `instruction`/`input`/`output` (Alpaca) or `messages` array (ChatML/ShareGPT)
Quality over quantity	1,000 high-quality examples often outperform 100,000 noisy ones
Diversity	Cover the range of expected inputs; avoid over-representing any pattern
Decontamination	Remove examples overlapping with evaluation benchmarks
Minimum size	LoRA/QLoRA: 500–5K for task-specific; 10K–100K for general instruction tuning
Validation split	Hold out 5–10%; monitor loss for overfitting

Fine-Tuning Pipeline¶

graph LR
    A[Collect/Generate Data] --> B[Format to JSONL]
    B --> C[Decontaminate & Deduplicate]
    C --> D[Train/Val Split]
    D --> E[Choose Method: LoRA / QLoRA / Full]
    E --> F[Train with Early Stopping]
    F --> G[Evaluate on Held-Out Set + Benchmarks]
    G --> H[Merge Adapter into Base]
    H --> I[Deploy]

Production Best Practices¶

Deployment Lifecycle¶

graph LR
    A[Prototype with Ollama] --> B[Validate with vLLM/SGLang]
    B --> C[Optimize: Quantization + Batching]
    C --> D[Load Test: Latency + Throughput]
    D --> E[Deploy: K8s + Autoscaling]
    E --> F[Monitor: Latency SLOs + Quality]

Checklist¶

Pre-Production Checklist

Model Selection

Benchmark candidate models on your actual task distribution
Test quantized variants (Q4_K_M, AWQ, FP8) against FP16 baseline
Validate edge cases: long inputs, multilingual, structured output

Infrastructure

Right-size GPU selection (H100 for throughput, A10G/L4 for cost, Apple Silicon for privacy)
Configure tensor parallelism if model exceeds single-GPU VRAM
Set up continuous batching with appropriate max batch size
Enable prefix caching for repetitive prompt patterns

Reliability

Deploy multiple replicas behind a load balancer
Configure autoscaling based on queue depth, not just CPU/GPU utilization
Set request timeouts and max token limits
Implement circuit breakers and fallback to smaller/cached models
Test failover by terminating instances under load

Monitoring

Track Time-to-First-Token (TTFT), tokens/second, and end-to-end latency at p50/p95/p99
Monitor GPU utilization, VRAM usage, and KV cache occupancy
Log prompt/response lengths for capacity planning
Set up alerts for latency SLO violations and OOM events

Quality

Implement output validation (JSON schema, safety filters)
Run periodic eval benchmarks against held-out test sets
Monitor for model drift after updates or quantization changes

Cost Optimization¶

Technique	Savings	Tradeoff
Quantization (FP16 → INT4)	70–75% VRAM, 2x speed	Minor quality loss
Speculative decoding	1.5–3x throughput	Draft model complexity
Prefix caching	30–60% for repetitive prompts	Memory for cache storage
Request batching	3–10x throughput	Slightly higher latency
Spot/preemptible instances	60–80% compute cost	Requires graceful interruption handling
Model distillation	5–10x cheaper inference	Upfront distillation cost

Common Pitfalls¶

Avoid These

Over-quantizing for your use case: Q2/Q3 works for casual chat but breaks agentic workflows, JSON output, and code generation
Ignoring TTFT: users perceive time-to-first-token as "speed" more than tokens/second
Static batching in production: wastes GPU cycles waiting for the longest request in the batch
No fallback strategy: a single model endpoint is a single point of failure
Benchmarking with synthetic data: real traffic patterns (variable lengths, bursty arrivals) behave very differently from uniform benchmarks
Skipping load testing: KV cache OOM under concurrent load is the most common production failure

CLI Recipes¶

Ollama (Local Inference)¶

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3:8b-q4_K_M
ollama run llama3:8b-q4_K_M

# List models
ollama list

# Serve as API
ollama serve  # default: http://localhost:11434
curl http://localhost:11434/api/generate -d '{"model":"llama3:8b-q4_K_M","prompt":"Hello"}'

vLLM (Production Serving)¶

# Install
pip install vllm

# Serve a model with tensor parallelism
vllm serve meta-llama/Llama-3-8B-Instruct \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 8192 \
  --port 8000

# OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'

MLX LM (Apple Silicon)¶

# Install
pip install mlx-lm

# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B \
  --quantize --q-bits 4 -o ./llama3-8b-4bit

# Generate
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit \
  --prompt "Explain transformers" --max-tokens 500

# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3-8B-4bit \
  --train --data ./train.jsonl --iters 1000

llama.cpp (GGUF Inference)¶

# Build
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release

# Run inference
./build/bin/llama-cli -m ./models/llama3-8b-q4_K_M.gguf \
  -p "What is quantization?" -n 256

# Start API server
./build/bin/llama-server -m ./models/llama3-8b-q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 -ngl 99  # -ngl: layers offloaded to GPU

Quantization with llama.cpp¶

# Convert HF model to GGUF
python convert_hf_to_gguf.py ./models/llama3-8b/ --outfile llama3-8b-f16.gguf

# Quantize
./build/bin/llama-quantize llama3-8b-f16.gguf llama3-8b-q4_K_M.gguf Q4_K_M

Fine-Tuning with QLoRA (Hugging Face)¶

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
)

lora_config = LoraConfig(
    r=16,               # rank
    lora_alpha=32,       # scaling
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()
# → trainable: 0.2% of total