LLM Fundamentals¶
A comprehensive reference covering how Large Language Models work — from transformer internals and training pipelines to quantization formats, model serving, and production deployment.
Summary¶
Large Language Models (LLMs) are neural networks trained on internet-scale text data to predict the next token in a sequence. Built on the transformer architecture (Vaswani et al., 2017), they use self-attention mechanisms to process relationships between all elements in a sequence simultaneously. When scaled to billions of parameters and trillions of training tokens, emergent capabilities like reasoning, code generation, and multi-step planning appear.
The modern LLM lifecycle spans four major phases:
- Pretraining — next-token prediction on trillions of tokens to build a broad knowledge base
- Supervised Fine-Tuning (SFT) — instruction-following on curated prompt-response pairs
- Alignment — RLHF or DPO to align outputs with human preferences
- Deployment — quantization, serving engines, distributed inference, and scaling
Key Concepts at a Glance¶
| Concept | What It Is | Details |
|---|---|---|
| Transformer | Core neural network architecture | architecture#transformer-architecture |
| Self-Attention | Mechanism to weigh token relationships (QKV) | architecture#self-attention-mechanism |
| FFN / SwiGLU | Feed-forward network with gated activation | architecture#inside-a-transformer-block |
| RMSNorm / Pre-LN | Normalization and placement in transformer blocks | architecture#inside-a-transformer-block |
| MoE | Mixture of Experts — sparse activation | architecture#mixture-of-experts-moe |
| Tokenization | Breaking text into sub-word units (BPE) | architecture#tokenization-and-embeddings |
| RoPE / ALiBi | Positional encoding methods | architecture#positional-encoding-methods |
| FP16 / BF16 / FP8 | Floating-point precision formats | architecture#floating-point-bit-layout |
| Quantization | Reducing weight precision (FP16→INT4) | architecture#quantization-formats |
| GGUF / GPTQ / AWQ / EXL2 | Model file and quantization formats | architecture#model-formats-and-quantization-methods |
| MLX | Apple Silicon ML framework | architecture#mlx-apple-silicon |
| Training Pipeline | Pretraining → SFT → RLHF/DPO | architecture#training-pipeline |
| Distillation | Teacher-student model compression | architecture#knowledge-distillation |
| Scaling Laws | Training and inference-time compute scaling | architecture#scaling-laws |
| Model Merging | Combining fine-tuned models (TIES, DARE, SLERP) | architecture#model-merging |
| VRAM Estimation | Calculating GPU memory requirements | operations#vram-estimation |
| GPU Selection | Hardware selection guide | operations#gpu-hardware-selection-guide |
| LoRA / QLoRA | Parameter-efficient fine-tuning | operations#parameter-efficient-fine-tuning-peft |
| RAG | Retrieval-Augmented Generation | operations#retrieval-augmented-generation-rag |
| KV Cache | Key-Value cache for inference speedup | operations#inference-optimization |
| vLLM / TensorRT | Production serving engines | operations#serving-engines |
| Benchmarks | MMLU, HumanEval, Arena ELO, etc. | operations#evaluation-benchmarks |
| Structured Output | Constrained decoding, JSON mode | operations#structured-output-and-constrained-decoding |
| Safety / Guardrails | Content filtering, prompt injection defense | operations#safety-guardrails-and-content-filtering |
Evaluation¶
| Dimension | Rating | Notes |
|---|---|---|
| Maturity | High | Transformer architecture is battle-tested since 2017; MoE dominant since 2024 |
| Ecosystem | Massive | Hugging Face, llama.cpp, vLLM, Ollama, MLX, NVIDIA TensorRT |
| Accessibility | Improving | QLoRA enables fine-tuning 65B models on a single 48GB GPU |
| Local Inference | Strong | GGUF + llama.cpp / MLX run 7B-30B models on consumer hardware |
Sources¶
- Attention Is All You Need (Vaswani et al., 2017) — the original transformer paper
- Large Language Model — Wikipedia
- Transformer (deep learning) — Wikipedia
- How Transformers Work — DataCamp
- What Are LLMs — IBM
- Transformer Explainer — Georgia Tech
- How Do Transformers Work — Hugging Face
- MoE LLMs — Cameron R. Wolfe
- DeepSeekMoE Paper
- MoE Infrastructure — Introl
- MoE Explained — LocalAIMaster
- MoE Powers Frontier Models — NVIDIA
- Comprehensive GGUF Analysis — Furkan Gozukara
- AI Quantization Guide 2025 — Local AI Zone
- LLM Quantization: BF16 vs FP8 vs INT4 — AIMultiple
- GGUF Q4 Q8 FP16 Guide — D-Central
- Picking the Right Size Brain — InstaSD
- LLM Quantization Explained 2026 — VRLA Tech
- Quantization Methods Compared — ai.rs
- Quantization Formats — CraftRigs
- LLM Quantization Guide 2026 — Prem AI
- MLX Framework — Apple
- Exploring LLMs with MLX on M5 — Apple ML Research
- MLX GitHub Repository
- Knowledge Distillation — IBM
- Student-Teacher Distillation Guide — DEV Community
- Knowledge Distillation for LLMs — Newline
- DPO — Cameron R. Wolfe
- RLHF Explained — IntuitionLabs
- LLM Training Methodologies 2025 — Klizos
- SFT Guide — Thunder Compute
- LoRA and QLoRA — Analytics Vidhya
- Fine-Tuning Infrastructure — Introl
- Efficient Fine-Tuning with LoRA — Databricks
- PEFT — Hugging Face GitHub
- vLLM Production Deployment — Introl
- vLLM Deep Dive — martinuke0
- LLM Inference Optimization — Clarifai
- Mastering LLM Inference Optimization — NVIDIA
- KV Cache Management Survey
- Optimizing Inference — Hugging Face
- Context Window & Token Guide — QubitTool
- LLM Tokenizers — DigitalOcean
- LLM Serving Guide — Inference.net
Architecture Internals¶
- Transformer Design Guide Part 2 — Rohit Bandaru
- LLaMA Components: RMSNorm, SwiGLU, RoPE — Michael Brenndoerfer
- Advanced Transformer Architectures — Nebius Academy
- Pre-LN vs Post-LN — APXML
- Positional Embeddings: RoPE & ALiBi — Towards Data Science
- ALiBi Deep Dive: Interpolation vs Extrapolation — SambaNova
- FP16 vs BF16 Explained — Furkan Gozukara
- BF16 vs FP16 Key Differences — Bitfern
Model Merging¶
- Model Merging for LLMs — NVIDIA
- Merge LLMs with MergeKit — Hugging Face
- Model Merging Survey — Cameron R. Wolfe
VRAM & GPU¶
- Calculating GPU Memory for LLMs — BentoML
- How Much VRAM for Inference — Modal
- How Much VRAM for Fine-Tuning — Modal
RAG¶
Benchmarks¶
- LLM Benchmarks Compared — LXT
- AI Benchmarks Guide — Analytics Vidhya
- 30 LLM Evaluation Benchmarks — Evidently AI
Structured Output¶
Safety & Guardrails¶
- NeMo Guardrails — NVIDIA GitHub
- LLM Security & Guardrails — Langfuse
- Bypassing Guardrails (2025 Research) — arXiv
Questions¶
- How will post-transformer architectures like Mamba (state-space models) reshape the LLM landscape?
- What is the practical floor for quantization before quality degrades unacceptably for agentic/tool-use workflows?
- Will disaggregated prefill/decode (NVIDIA Dynamo, llm-d) become the default serving pattern?
- How will Apple's MLX ecosystem evolve with M5/M6 unified memory scaling?