LLM Fundamentals¶
A comprehensive reference covering how Large Language Models work — from transformer internals and training pipelines to quantization formats, model serving, and production deployment.
Summary¶
Large Language Models (LLMs) are neural networks trained on internet-scale text data to predict the next token in a sequence. Built on the transformer architecture (Vaswani et al., 2017), they use self-attention mechanisms to process relationships between all elements in a sequence simultaneously. When scaled to billions of parameters and trillions of training tokens, emergent capabilities like reasoning, code generation, and multi-step planning appear.
The modern LLM lifecycle spans four major phases:
- Pretraining — next-token prediction on trillions of tokens to build a broad knowledge base
- Supervised Fine-Tuning (SFT) — instruction-following on curated prompt-response pairs
- Alignment — RLHF or DPO to align outputs with human preferences
- Deployment — quantization, serving engines, distributed inference, and scaling
Key Concepts at a Glance¶
| Concept | What It Is | Details |
|---|---|---|
| Transformer | Core neural network architecture | architecture#transformer-architecture |
| Self-Attention | Mechanism to weigh token relationships | architecture#self-attention-mechanism |
| MoE | Mixture of Experts — sparse activation | architecture#mixture-of-experts-moe |
| Tokenization | Breaking text into sub-word units (BPE) | architecture#tokenization-and-embeddings |
| Quantization | Reducing weight precision (FP16→INT4) | architecture#quantization-formats |
| GGUF / GPTQ / AWQ | Model file and quantization formats | architecture#model-formats-and-quantization-methods |
| MLX | Apple Silicon ML framework | architecture#mlx-apple-silicon |
| Distillation | Teacher-student model compression | architecture#knowledge-distillation |
| LoRA / QLoRA | Parameter-efficient fine-tuning | operations#parameter-efficient-fine-tuning-peft |
| KV Cache | Key-Value cache for inference speedup | operations#inference-optimization |
| vLLM / TensorRT | Production serving engines | operations#serving-engines |
Evaluation¶
| Dimension | Rating | Notes |
|---|---|---|
| Maturity | High | Transformer architecture is battle-tested since 2017; MoE dominant since 2024 |
| Ecosystem | Massive | Hugging Face, llama.cpp, vLLM, Ollama, MLX, NVIDIA TensorRT |
| Accessibility | Improving | QLoRA enables fine-tuning 65B models on a single 48GB GPU |
| Local Inference | Strong | GGUF + llama.cpp / MLX run 7B-30B models on consumer hardware |
Sources¶
- Attention Is All You Need (Vaswani et al., 2017) — the original transformer paper
- Large Language Model — Wikipedia
- Transformer (deep learning) — Wikipedia
- How Transformers Work — DataCamp
- What Are LLMs — IBM
- Transformer Explainer — Georgia Tech
- How Do Transformers Work — Hugging Face
- MoE LLMs — Cameron R. Wolfe
- DeepSeekMoE Paper
- MoE Infrastructure — Introl
- MoE Explained — LocalAIMaster
- MoE Powers Frontier Models — NVIDIA
- Comprehensive GGUF Analysis — Furkan Gozukara
- AI Quantization Guide 2025 — Local AI Zone
- LLM Quantization: BF16 vs FP8 vs INT4 — AIMultiple
- GGUF Q4 Q8 FP16 Guide — D-Central
- Picking the Right Size Brain — InstaSD
- LLM Quantization Explained 2026 — VRLA Tech
- Quantization Methods Compared — ai.rs
- Quantization Formats — CraftRigs
- LLM Quantization Guide 2026 — Prem AI
- MLX Framework — Apple
- Exploring LLMs with MLX on M5 — Apple ML Research
- MLX GitHub Repository
- Knowledge Distillation — IBM
- Student-Teacher Distillation Guide — DEV Community
- Knowledge Distillation for LLMs — Newline
- DPO — Cameron R. Wolfe
- RLHF Explained — IntuitionLabs
- LLM Training Methodologies 2025 — Klizos
- SFT Guide — Thunder Compute
- LoRA and QLoRA — Analytics Vidhya
- Fine-Tuning Infrastructure — Introl
- Efficient Fine-Tuning with LoRA — Databricks
- PEFT — Hugging Face GitHub
- vLLM Production Deployment — Introl
- vLLM Deep Dive — martinuke0
- LLM Inference Optimization — Clarifai
- Mastering LLM Inference Optimization — NVIDIA
- KV Cache Management Survey
- Optimizing Inference — Hugging Face
- Context Window & Token Guide — QubitTool
- LLM Tokenizers — DigitalOcean
- LLM Serving Guide — Inference.net
Questions¶
- How will post-transformer architectures like Mamba (state-space models) reshape the LLM landscape?
- What is the practical floor for quantization before quality degrades unacceptably for agentic/tool-use workflows?
- Will disaggregated prefill/decode (NVIDIA Dynamo, llm-d) become the default serving pattern?
- How will Apple's MLX ecosystem evolve with M5/M6 unified memory scaling?