Skip to content

LLM Operations

Deployment, serving engines, inference optimization, fine-tuning, distributed scaling, and production best practices for LLMs.


Serving Engines

vLLM

Open-source inference engine optimized for high throughput. Its core innovation is PagedAttention, which eliminates 60–80% of memory waste from KV cache fragmentation.

  • 14–24x higher throughput than Hugging Face Transformers
  • 2.2–3.5x higher throughput than early TGI
  • OpenAI-compatible API out of the box
  • Continuous batching, prefix caching, SLA-aware scheduling
  • Stripe: 73% inference cost reduction, 50M daily API calls on 1/3 the GPU fleet

Distributed parallelism in vLLM:

Strategy When to Use Config
Tensor Parallelism (TP) Model too large for one GPU, fits one node tensor_parallel_size=4
Pipeline Parallelism (PP) Model too large for one node pipeline_parallel_size=N_nodes
TP + PP combined Multi-node, large models Set both parameters

Default runtime: Ray for multi-node, Python multiprocessing for single-node.

TensorRT-LLM

NVIDIA's specialized inference library. Uses CUDA graph optimizations, fused kernels, and Tensor Core acceleration.

  • H100 + FP8: >10,000 output tok/s at 64 concurrent requests, ~100ms TTFT
  • Requires upfront "engine build" step per model/GPU/precision configuration
  • Highest raw performance on NVIDIA hardware but complex to set up

SGLang

High-performance serving with RadixAttention for aggressive KV cache reuse across requests. Best for:

  • Agentic workflows with repeated prefixes
  • RAG systems with shared context
  • Multi-turn conversations

Ollama

Single-command local inference. Wraps llama.cpp with a clean CLI and REST API.

ollama pull llama3:8b-q4_K_M
ollama run llama3:8b-q4_K_M "What is attention?"

Best for: prototyping, local development, personal use. Not production-grade at scale.

LM Studio

GUI-based local inference for GGUF models on macOS/Windows/Linux. Download models from Hugging Face, run with one click. Similar audience as Ollama but with a visual interface.

Engine Selection Guide

Scenario Recommended Engine
Fast time-to-serve, OpenAI-compatible vLLM
Absolute lowest latency on NVIDIA TensorRT-LLM
Agentic / RAG with prefix sharing SGLang
Long conversations TGI v3 (prefix caching)
Local prototyping Ollama or LM Studio
Apple Silicon MLX LM or Ollama

VRAM Estimation

Core Formula

$$ \text{VRAM}_{\text{total}} = \text{Weights} + \text{KV Cache} + \text{Activations} + \text{Overhead} $$

1. Model Weights

$$ \text{Weight Memory} = \text{Parameters} \times \text{Bytes per Parameter} $$

Precision Bytes/Param 7B Model 13B Model 70B Model
FP32 4 28 GB 52 GB 280 GB
FP16 / BF16 2 14 GB 26 GB 140 GB
INT8 1 7 GB 13 GB 70 GB
INT4 (Q4) 0.5 3.5 GB 6.5 GB 35 GB

Quick rule of thumb: ~2 GB per 1B parameters at FP16, ~0.5 GB per 1B at INT4.

2. KV Cache

The KV cache is the hidden memory monster. It scales linearly with sequence length, batch size, and number of layers:

$$ \text{KV Cache} = 2 \times n_{\text{layers}} \times n_{\text{kv_heads}} \times d_{\text{head}} \times \text{seq_len} \times \text{batch} \times \text{bytes} $$

For LLaMA 3 70B with GQA (80 layers, 8 KV heads, 128 head dim): ~0.31 MB per token at BF16. Standard MHA would require ~2.5 MB per token (8x more).

KV Cache Gotcha

A model that fits comfortably at 2K context may OOM at 32K. Each 1,000 tokens adds ~0.11 GB for a 7B model, but for 70B with long context the KV cache can exceed the weight memory.

3. Activations and Overhead

  • Activations: intermediate tensors during the forward pass; typically 5–20% of weight memory for inference
  • Framework overhead: CUDA context, memory allocator, driver — 500 MB to 2 GB

Practical formula for inference:

$$ \text{VRAM}_{\text{inference}} \approx \text{Weight Memory} \times 1.2 + \text{KV Cache} $$

4. Training VRAM

Training needs ~4x inference memory due to gradients and optimizer states:

Component Memory (FP16 training)
Model weights 2 bytes/param
Gradients 2 bytes/param
Optimizer states (Adam) 8 bytes/param (2x FP32 moments)
Activations Variable (depends on batch size, checkpointing)
Total ~16 GB per 1B params (rule of thumb)

QLoRA reduces this to ~1 GB per 1B params by quantizing weights to 4-bit and training only LoRA adapters.

MoE Memory

All experts must reside in VRAM even though only top-K are active per token. DeepSeek-V3 (671B total) needs hundreds of GB even though only 37B params fire per token.


GPU Hardware Selection Guide

NVIDIA Data Center GPUs

GPU VRAM Memory BW FP16 TFLOPS FP8 TFLOPS Best For
H100 SXM 80 GB HBM3 3.35 TB/s 989 1,979 Production serving, training
H100 NVL 94 GB HBM3 3.9 TB/s 989 1,979 Large model inference
A100 80GB 80 GB HBM2e 2.0 TB/s 312 N/A Previous gen workhorse
A100 40GB 40 GB HBM2e 1.6 TB/s 312 N/A Budget production
L40S 48 GB GDDR6 864 GB/s 362 733 Inference-optimized, cost-effective
A10G 24 GB GDDR6 600 GB/s 125 N/A Cloud inference (AWS g5)
L4 24 GB GDDR6 300 GB/s 121 242 Edge/cost-sensitive inference
B200 192 GB HBM3e 8.0 TB/s ~2,250 ~4,500 Next-gen Blackwell; 2025+

Consumer GPUs

GPU VRAM Memory BW Practical Use
RTX 4090 24 GB GDDR6X 1,008 GB/s Best consumer GPU for local LLMs; runs 13B Q4 comfortably
RTX 4080 16 GB 717 GB/s 7B Q4 with room for KV cache
RTX 3090 24 GB GDDR6X 936 GB/s Popular for fine-tuning; great VRAM/dollar on used market
RTX 5090 32 GB GDDR7 1,790 GB/s Best consumer option in 2025; runs 30B Q4

Apple Silicon

Chip Unified Memory Memory BW Notes
M4 Pro 24–48 GB 273 GB/s Runs 8B BF16 or 30B Q4
M4 Max 36–128 GB 546 GB/s Runs 70B Q4 on 128GB config
M4 Ultra 192–512 GB 819 GB/s Can run 405B Q4 or 671B MoE on 512GB
M5 24–48 GB 153 GB/s 19–27% faster than M4 for LLM inference

Apple Silicon advantage: unified memory means the entire system RAM is available for model weights, not just dedicated VRAM. A 192GB M4 Ultra can load models that would require 3x H100s.

Which GPU for Which Task?

Task Recommended Why
Local chat (personal use) RTX 4090, M4 Pro/Max, RTX 5090 24–32GB VRAM handles 7B–13B easily
Fine-tuning (QLoRA, 7B–13B) RTX 3090/4090 (24GB) Sufficient VRAM for QLoRA
Fine-tuning (QLoRA, 70B) 2x A100 80GB or 1x H100 70B QLoRA needs ~80GB
Production serving (<13B) L4 or A10G Cost-effective for smaller models
Production serving (70B+) 2–4x H100 with TP Tensor parallelism across GPUs
Research / large-scale training H100/B200 clusters Maximum throughput

Inference Optimization

KV Cache

During autoregressive generation, each new token's attention computation requires the Keys and Values of all previous tokens. The KV cache stores these to avoid redundant recomputation.

Problem: KV cache grows linearly with sequence length and batch size, becoming the primary memory bottleneck for long-context inference.

Optimization techniques:

Technique Description Impact
PagedAttention (vLLM) Manages KV cache in non-contiguous pages, like OS virtual memory Eliminates 60–80% memory waste
KV Cache Quantization Compress KV cache to INT4/INT2/FP4 NVIDIA NVFP4: <1% accuracy loss vs BF16
Token Pruning Evict low-attention tokens from cache Reduces memory for ultra-long contexts
Head Fusion Merge similar attention heads' KV entries Reduces cache size for GQA models
Entropy-Guided Caching Allocate more cache to high-entropy (broadly attending) heads Better quality per memory byte
Static KV Cache Pre-allocate fixed-size cache Enables torch.compile for up to 4x speedup

Speculative Decoding

Uses a small, fast draft model to generate candidate tokens, then verifies them in a single forward pass of the large model. Correct tokens are accepted for free.

graph LR
    A[Draft Model - 1B] -->|Generate 5 candidate tokens| B[Large Model - 70B]
    B -->|Verify in single pass| C{Accept / Reject}
    C -->|Accepted tokens| D[Output]
    C -->|Rejected| E[Revert to large model generation]
  • Typically 1.5–3x speedup with no quality loss
  • DEFT (ICLR 2025): tree-structured speculative decoding achieves 2.2–3.6x speedup
  • Prompt Lookup Decoding: uses the prompt itself as the draft source

Flash Attention

Optimizes attention computation by minimizing GPU memory movement (HBM ↔ SRAM transfers). Standard attention materializes the full $N \times N$ attention matrix; Flash Attention tiles the computation to keep working data in fast SRAM.

  • Flash Attention 2: 2x faster than Flash Attention 1
  • Flash Attention 3 (July 2024): further optimizations for H100
  • FlashInfer (MLSys 2025): customizable attention engine with JIT compilation, integrated into SGLang, vLLM, and MLC-Engine

Batching Strategies

Strategy Description Best For
Static Batching Fixed batch, all requests start/end together Simple but wasteful
Continuous Batching New requests join the batch as slots free up Standard for production serving
Disaggregated Prefill/Decode Separate GPU pools for prefill vs decode phases Advanced; used by NVIDIA Dynamo

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all parameters — prohibitively expensive for large models. PEFT methods train <1% of parameters while retaining 90–95% of full fine-tuning quality.

LoRA (Low-Rank Adaptation)

Injects trainable low-rank matrices into each transformer layer while freezing original weights.

How it works:

For a weight matrix $W \in \mathbb{R}^{d \times k}$, instead of updating $W$ directly, LoRA adds:

$$ W' = W + \Delta W = W + BA $$

Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$ (typically 8–64).

Property Value
Trainable params 0.2–0.3% of total
Adapter size Few MB (vs GB for full model)
Inference cost Zero — adapters merge into base weights
Task switching Swap adapter files without reloading base model
Quality Competitive with full fine-tuning for most tasks

QLoRA (Quantized LoRA)

Loads the base model in 4-bit NormalFloat4 quantization while training LoRA adapters in higher precision:

  • 75–80% memory reduction vs 16-bit LoRA
  • Enables fine-tuning 65B models on a single 48GB GPU
  • Quality on par with full 16-bit fine-tuning in many cases

Key innovations:

  • NF4 data type: optimized for normally distributed weights
  • Double quantization: compresses scale/offset constants themselves
  • Unified memory paging: seamless GPU↔CPU transfers when GPU OOM

Adapter Modules

Small feed-forward networks inserted after attention or FFN sublayers. Base model is frozen; only adapter weights train.

  • More modular than LoRA (can mix/match per task)
  • Slight inference overhead (adapters don't merge into base weights)
  • Useful for multi-task serving with shared base

When to Use Which

Method Best For Hardware
Full Fine-Tuning Maximum quality, small models (<7B) 8x A100 or equivalent
LoRA General fine-tuning, easy deployment 1–2x A100
QLoRA Large models (13B–70B), limited VRAM Single 24–48GB GPU
Adapters Multi-task serving, modular systems Similar to LoRA

Tooling

  • Hugging Face PEFT: canonical library; model.add_adapter() integrates with Transformers
  • bitsandbytes: 4-bit/8-bit quantization for QLoRA
  • Unsloth: 2x faster LoRA/QLoRA training with custom CUDA kernels
  • Axolotl: config-driven fine-tuning framework wrapping multiple methods

Distributed Inference and Scaling

Parallelism Strategies

Strategy What It Splits When to Use
Tensor Parallelism (TP) Individual layer weights across GPUs Model too large for one GPU
Pipeline Parallelism (PP) Sequential layers across GPUs/nodes Multi-node deployment
Data Parallelism (DP) Replicas of the full model High throughput, model fits one GPU
Expert Parallelism (EP) MoE experts across GPUs MoE models with many experts

NVIDIA Dynamo

Announced at GTC 2025, Dynamo is a distributed inference orchestration layer on top of vLLM/TensorRT-LLM/SGLang:

  • Disaggregated prefill and decode: separate GPU pools optimized for each phase
  • Coordinates work across GPU pools
  • Smart request routing based on KV cache locality

llm-d (Kubernetes-Native)

Launched May 2025 by Red Hat, Google Cloud, IBM, NVIDIA, and CoreWeave:

  • Kubernetes-native distributed LLM serving
  • Disaggregated prefill/decode stages
  • Gateway API Inference Extension for routing
  • Dynamic Resource Allocation (DRA) for GPU scheduling

Multi-Model Routing

For production deployments with multiple models:

Tool Purpose
LiteLLM Unified API gateway for 100+ LLM providers; fallback routing
Envoy AI Gateway Proxy-level routing, rate limiting, auth
OpenRouter Third-party multi-model API with cost optimization

Retrieval-Augmented Generation (RAG)

RAG allows LLMs to access external knowledge at inference time, reducing hallucination and enabling domain-specific responses without retraining.

Architecture

graph LR
    A[User Query] --> B[Embedding Model]
    B --> C[Vector Search]
    D[Document Corpus] --> E[Chunking]
    E --> F[Embedding Model]
    F --> G[Vector Database]
    C --> G
    G --> H[Top-K Relevant Chunks]
    H --> I[Augmented Prompt]
    A --> I
    I --> J[LLM]
    J --> K[Grounded Response]

Pipeline Steps

Step What Happens Key Decisions
1. Document Preparation Clean, parse, and normalize source documents Format handling (PDF, HTML, markdown)
2. Chunking Split documents into retrieval units Chunk size (256–1024 tokens), overlap, strategy
3. Embedding Convert chunks to dense vectors Model choice (voyage-3-large, text-embedding-3-large)
4. Indexing Store vectors in a vector database Database choice (Pinecone, Weaviate, Milvus, Qdrant)
5. Retrieval Find top-K chunks similar to the query Similarity metric (cosine), hybrid search, K value
6. Reranking Re-score retrieved chunks for relevance Cross-encoder reranker (Cohere, BGE, ColBERT)
7. Augmentation Inject chunks into the LLM prompt Prompt template design, chunk ordering
8. Generation LLM produces a grounded answer Citation generation, faithfulness checking

Chunking Strategies

Strategy Description Best For
Fixed-size Split at N tokens with overlap Simple, predictable
Recursive Split by paragraph, then sentence, then word General-purpose (LangChain default)
Semantic Split when embedding similarity between adjacent segments drops Better coherence; +9% recall over fixed
Document-structure Split by headings, sections, markdown structure Structured documents (docs, wikis)
Proposition-based Extract atomic facts as individual chunks Highest precision, expensive to compute

Vector Databases

Database Strength Scale Managed Option
Pinecone Zero-ops managed service Billions Yes (primary)
Weaviate Hybrid search + knowledge graph Millions–Billions Yes
Milvus Billion-scale, distributed Billions Zilliz Cloud
Qdrant Complex metadata filtering Millions–Billions Yes
Chroma Developer-friendly, lightweight Millions No (self-hosted)
pgvector PostgreSQL extension; no new infra Millions Via any Postgres host

Hybrid Search and Reranking

Combining dense (semantic) and sparse (lexical/BM25) retrieval improves results — dense search handles paraphrases and meaning, while sparse search catches exact terms, acronyms, and domain jargon.

After initial retrieval, a cross-encoder reranker re-scores each (query, chunk) pair, adding 10–30% precision improvement at 50–100ms latency cost.

RAG vs Fine-Tuning

Dimension RAG Fine-Tuning
Knowledge update Instant (swap documents) Requires retraining
Cost Low (no training) High (GPU hours)
Hallucination Reduced (grounded in sources) Can still hallucinate
Latency Higher (retrieval + generation) Lower (single forward pass)
Best for Factual Q&A, documents, knowledge bases Style/format changes, task specialization

RAG can cut fine-tuning spend by 60–80% by delivering domain knowledge through retrieval rather than parameter updates.


Evaluation Benchmarks

Major Benchmarks

Benchmark What It Tests Format Status (2025)
MMLU General knowledge (57 subjects) Multiple choice Saturated; use MMLU-Pro
MMLU-Pro Harder MMLU with more choices Multiple choice Current knowledge standard
HumanEval Code generation (Python) Generate code + unit tests Top models >90%
SWE-bench Real software engineering (bugs in repos) Navigate codebase, write patches Gold standard for coding
LiveCodeBench Rolling code challenges Competitive programming Contamination-resistant
GSM8K Grade school math (2–8 steps) Chain-of-thought reasoning Saturated; contamination concerns
MATH / MATH-500 Competition math (AMC/AIME) Multi-step symbolic reasoning DeepSeek-R1 at 97.3%
GPQA Graduate-level science Expert-written multiple choice Hard frontier benchmark
HLE (Humanity's Last Exam) 2,500 expert questions Multi-modal Very hard; even frontier LLMs score low
Arena ELO (Chatbot Arena) Human preference ranking Blind A/B voting Most trusted overall quality ranking
IFEval Instruction following Format/constraint verification Tests structured output compliance

Benchmark Pitfalls

Use Benchmarks Carefully

  • Saturation: MMLU, GSM8K, HumanEval are largely solved — score differences at 90%+ are often noise
  • Contamination: up to 13% accuracy drops on GSM8K when removing training-overlap examples
  • Task mismatch: match benchmarks to your actual use case — MMLU ≠ code, HumanEval ≠ real engineering
  • Arena ELO: most trusted overall signal but biased toward chatbot-style tasks

Structured Output and Constrained Decoding

Ensuring LLMs produce valid JSON, function calls, or other structured formats.

Approaches

Method Guarantee How It Works
Prompt engineering Best-effort Describe desired format in prompt
JSON mode (API) Schema-conformant Provider constrains output to valid JSON
Constrained decoding 100% conformant Mask invalid tokens at each step using grammar/schema
Fine-tuning High but not guaranteed Train on structured input-output pairs

How Constrained Decoding Works

A logit processor sits between the model's output and sampling. It tracks position within the target grammar (JSON Schema, regex, EBNF) and sets invalid token logits to $-\infty$:

$$ P'(t) = \text{normalize}(P(t) \odot \text{mask}(t)) $$

Key Libraries

Library Approach Overhead Integration
XGrammar Pushdown automaton; precomputed bitmasks Near-zero (~1%) vLLM, SGLang, MLC
llguidance (Microsoft) Grammar-based; credited by OpenAI Low Guidance, OpenAI API
Outlines Regex/CFG masking Low-moderate Hugging Face, vLLM
Guidance (Microsoft) Template-based with token healing Low Standalone

vLLM (0.8.5+) supports structural tags — constrain only parts of the output. The model generates free-form text, switches into JSON-constrained tool calls, then back. Critical for agentic workflows.


Safety, Guardrails, and Content Filtering

Guardrail Frameworks

Framework Provider Approach
NeMo Guardrails NVIDIA Programmable rails via Colang language; input/output/dialog/retrieval rails
Llama Guard Meta LLM-based classifier; categorizes prompts as safe/unsafe
Guardrails AI Open source Validator pipeline; JSON schema, toxicity, PII redaction
Azure AI Content Safety Microsoft Cloud API; real-time content classification
Lakera Third-party Specialized prompt injection detection

Types of Rails

Rail Type When What
Input rails Before LLM processes request Reject harmful prompts, mask PII, detect injection
Output rails After LLM generates response Filter toxic content, validate format
Dialog rails During multi-turn conversation Enforce flow, prevent topic drift
Retrieval rails In RAG pipelines Filter harmful retrieved chunks

Prompt Injection (2025)

The top LLM security concern. 2025 research revealed:

  • Attack success rates of 72–92% against several guardrail systems
  • Emoji smuggling achieved 100% bypass rate
  • Multi-layered defense is the only viable approach

Defense in Depth

No single guardrail is sufficient. Combine: (1) input classification (Llama Guard), (2) output filtering (toxicity, PII), (3) rate limiting + anomaly detection, (4) human review for high-stakes decisions.


Additional PEFT Methods

Beyond LoRA, QLoRA, and adapters:

Method How It Works Trainable Params Best For
DoRA (Weight-Decomposed LRA) Decomposes weights into magnitude + direction; LoRA on direction only ~same as LoRA Better quality at same rank; drop-in LoRA replacement
Prefix Tuning Prepends trainable "virtual tokens" to each layer's K and V ~0.1% Few-shot task adaptation
Prompt Tuning Adds trainable embeddings to input only (not each layer) ~0.01% Extremely lightweight; best for classification
IA3 Learns rescaling vectors for K, V, and FFN activations ~0.01% Few-shot with minimal parameters

Fine-Tuning Data Preparation

Aspect Recommendation
Format JSONL with instruction/input/output (Alpaca) or messages array (ChatML/ShareGPT)
Quality over quantity 1,000 high-quality examples often outperform 100,000 noisy ones
Diversity Cover the range of expected inputs; avoid over-representing any pattern
Decontamination Remove examples overlapping with evaluation benchmarks
Minimum size LoRA/QLoRA: 500–5K for task-specific; 10K–100K for general instruction tuning
Validation split Hold out 5–10%; monitor loss for overfitting

Fine-Tuning Pipeline

graph LR
    A[Collect/Generate Data] --> B[Format to JSONL]
    B --> C[Decontaminate & Deduplicate]
    C --> D[Train/Val Split]
    D --> E[Choose Method: LoRA / QLoRA / Full]
    E --> F[Train with Early Stopping]
    F --> G[Evaluate on Held-Out Set + Benchmarks]
    G --> H[Merge Adapter into Base]
    H --> I[Deploy]

Production Best Practices

Deployment Lifecycle

graph LR
    A[Prototype with Ollama] --> B[Validate with vLLM/SGLang]
    B --> C[Optimize: Quantization + Batching]
    C --> D[Load Test: Latency + Throughput]
    D --> E[Deploy: K8s + Autoscaling]
    E --> F[Monitor: Latency SLOs + Quality]

Checklist

Pre-Production Checklist

Model Selection

  • Benchmark candidate models on your actual task distribution
  • Test quantized variants (Q4_K_M, AWQ, FP8) against FP16 baseline
  • Validate edge cases: long inputs, multilingual, structured output

Infrastructure

  • Right-size GPU selection (H100 for throughput, A10G/L4 for cost, Apple Silicon for privacy)
  • Configure tensor parallelism if model exceeds single-GPU VRAM
  • Set up continuous batching with appropriate max batch size
  • Enable prefix caching for repetitive prompt patterns

Reliability

  • Deploy multiple replicas behind a load balancer
  • Configure autoscaling based on queue depth, not just CPU/GPU utilization
  • Set request timeouts and max token limits
  • Implement circuit breakers and fallback to smaller/cached models
  • Test failover by terminating instances under load

Monitoring

  • Track Time-to-First-Token (TTFT), tokens/second, and end-to-end latency at p50/p95/p99
  • Monitor GPU utilization, VRAM usage, and KV cache occupancy
  • Log prompt/response lengths for capacity planning
  • Set up alerts for latency SLO violations and OOM events

Quality

  • Implement output validation (JSON schema, safety filters)
  • Run periodic eval benchmarks against held-out test sets
  • Monitor for model drift after updates or quantization changes

Cost Optimization

Technique Savings Tradeoff
Quantization (FP16 → INT4) 70–75% VRAM, 2x speed Minor quality loss
Speculative decoding 1.5–3x throughput Draft model complexity
Prefix caching 30–60% for repetitive prompts Memory for cache storage
Request batching 3–10x throughput Slightly higher latency
Spot/preemptible instances 60–80% compute cost Requires graceful interruption handling
Model distillation 5–10x cheaper inference Upfront distillation cost

Common Pitfalls

Avoid These

  • Over-quantizing for your use case: Q2/Q3 works for casual chat but breaks agentic workflows, JSON output, and code generation
  • Ignoring TTFT: users perceive time-to-first-token as "speed" more than tokens/second
  • Static batching in production: wastes GPU cycles waiting for the longest request in the batch
  • No fallback strategy: a single model endpoint is a single point of failure
  • Benchmarking with synthetic data: real traffic patterns (variable lengths, bursty arrivals) behave very differently from uniform benchmarks
  • Skipping load testing: KV cache OOM under concurrent load is the most common production failure

CLI Recipes

Ollama (Local Inference)

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3:8b-q4_K_M
ollama run llama3:8b-q4_K_M

# List models
ollama list

# Serve as API
ollama serve  # default: http://localhost:11434
curl http://localhost:11434/api/generate -d '{"model":"llama3:8b-q4_K_M","prompt":"Hello"}'

vLLM (Production Serving)

# Install
pip install vllm

# Serve a model with tensor parallelism
vllm serve meta-llama/Llama-3-8B-Instruct \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 8192 \
  --port 8000

# OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'

MLX LM (Apple Silicon)

# Install
pip install mlx-lm

# Download and convert to 4-bit
mlx_lm.convert --hf-path meta-llama/Llama-3-8B \
  --quantize --q-bits 4 -o ./llama3-8b-4bit

# Generate
mlx_lm.generate --model mlx-community/Llama-3-8B-4bit \
  --prompt "Explain transformers" --max-tokens 500

# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3-8B-4bit \
  --train --data ./train.jsonl --iters 1000

llama.cpp (GGUF Inference)

# Build
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release

# Run inference
./build/bin/llama-cli -m ./models/llama3-8b-q4_K_M.gguf \
  -p "What is quantization?" -n 256

# Start API server
./build/bin/llama-server -m ./models/llama3-8b-q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 -ngl 99  # -ngl: layers offloaded to GPU

Quantization with llama.cpp

# Convert HF model to GGUF
python convert_hf_to_gguf.py ./models/llama3-8b/ --outfile llama3-8b-f16.gguf

# Quantize
./build/bin/llama-quantize llama3-8b-f16.gguf llama3-8b-q4_K_M.gguf Q4_K_M

Fine-Tuning with QLoRA (Hugging Face)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
)

lora_config = LoraConfig(
    r=16,               # rank
    lora_alpha=32,       # scaling
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()
# → trainable: 0.2% of total