LLM Fundamentals¶

A comprehensive reference covering how Large Language Models work — from transformer internals and training pipelines to quantization formats, model serving, and production deployment.

Summary¶

Large Language Models (LLMs) are neural networks trained on internet-scale text data to predict the next token in a sequence. Built on the transformer architecture (Vaswani et al., 2017), they use self-attention mechanisms to process relationships between all elements in a sequence simultaneously. When scaled to billions of parameters and trillions of training tokens, emergent capabilities like reasoning, code generation, and multi-step planning appear.

The modern LLM lifecycle spans four major phases:

Pretraining — next-token prediction on trillions of tokens to build a broad knowledge base
Supervised Fine-Tuning (SFT) — instruction-following on curated prompt-response pairs
Alignment — RLHF or DPO to align outputs with human preferences
Deployment — quantization, serving engines, distributed inference, and scaling

Key Concepts at a Glance¶

Concept	What It Is	Details
Transformer	Core neural network architecture	architecture#transformer-architecture
Self-Attention	Mechanism to weigh token relationships (QKV)	architecture#self-attention-mechanism
FFN / SwiGLU	Feed-forward network with gated activation	architecture#inside-a-transformer-block
RMSNorm / Pre-LN	Normalization and placement in transformer blocks	architecture#inside-a-transformer-block
MoE	Mixture of Experts — sparse activation	architecture#mixture-of-experts-moe
Tokenization	Breaking text into sub-word units (BPE)	architecture#tokenization-and-embeddings
RoPE / ALiBi	Positional encoding methods	architecture#positional-encoding-methods
FP16 / BF16 / FP8	Floating-point precision formats	architecture#floating-point-bit-layout
Quantization	Reducing weight precision (FP16→INT4)	architecture#quantization-formats
GGUF / GPTQ / AWQ / EXL2	Model file and quantization formats	architecture#model-formats-and-quantization-methods
MLX	Apple Silicon ML framework	architecture#mlx-apple-silicon
Training Pipeline	Pretraining → SFT → RLHF/DPO	architecture#training-pipeline
Distillation	Teacher-student model compression	architecture#knowledge-distillation
Scaling Laws	Training and inference-time compute scaling	architecture#scaling-laws
Model Merging	Combining fine-tuned models (TIES, DARE, SLERP)	architecture#model-merging
VRAM Estimation	Calculating GPU memory requirements	operations#vram-estimation
GPU Selection	Hardware selection guide	operations#gpu-hardware-selection-guide
LoRA / QLoRA	Parameter-efficient fine-tuning	operations#parameter-efficient-fine-tuning-peft
RAG	Retrieval-Augmented Generation	operations#retrieval-augmented-generation-rag
KV Cache	Key-Value cache for inference speedup	operations#inference-optimization
vLLM / TensorRT	Production serving engines	operations#serving-engines
Benchmarks	MMLU, HumanEval, Arena ELO, etc.	operations#evaluation-benchmarks
Structured Output	Constrained decoding, JSON mode	operations#structured-output-and-constrained-decoding
Safety / Guardrails	Content filtering, prompt injection defense	operations#safety-guardrails-and-content-filtering

Evaluation¶

Dimension	Rating	Notes
Maturity	High	Transformer architecture is battle-tested since 2017; MoE dominant since 2024
Ecosystem	Massive	Hugging Face, llama.cpp, vLLM, Ollama, MLX, NVIDIA TensorRT
Accessibility	Improving	QLoRA enables fine-tuning 65B models on a single 48GB GPU
Local Inference	Strong	GGUF + llama.cpp / MLX run 7B-30B models on consumer hardware

Sources¶

Alisa's Math Notes (Notion) — probability, statistics, combinatorics, and MLE reference for ML practitioners. See ref - alisa-math-notes
Alisa's Book of LLMs (Notion) — derivation-heavy transformer internals, post-training (PPO/RLHF/GRPO/DPO), parallelism, and multimodality. See ref - alisa-book-of-llms
Attention Is All You Need (Vaswani et al., 2017) — the original transformer paper
Large Language Model — Wikipedia
Transformer (deep learning architecture) — Wikipedia
How Transformers Work — DataCamp
What Are LLMs — IBM
Transformer Explainer — Georgia Tech
How Do Transformers Work — Hugging Face
MoE LLMs — Cameron R. Wolfe
DeepSeekMoE Paper
MoE Infrastructure — Introl
MoE Explained — LocalAIMaster
MoE Powers Frontier Models — NVIDIA
Comprehensive GGUF Analysis — Furkan Gozukara
AI Quantization Guide 2025 — Local AI Zone
LLM Quantization: BF16 vs FP8 vs INT4 — AIMultiple
GGUF Q4 Q8 FP16 Guide — D-Central
Picking the Right Size Brain — InstaSD
LLM Quantization Explained 2026 — VRLA Tech
Quantization Methods Compared — ai.rs
Quantization Formats — CraftRigs
Quantization Overview — Hugging Face Transformers docs — GGUF/AWQ/GPTQ/bitsandbytes methods compared
MLX Framework — Apple
Exploring LLMs with MLX on M5 — Apple ML Research
MLX GitHub Repository
Knowledge Distillation — IBM
Student-Teacher Distillation Guide — DEV Community
Knowledge Distillation for LLMs — Newline
DPO — Cameron R. Wolfe
RLHF Explained — IntuitionLabs
LLM Training Methodologies 2025 — Klizos
SFT Guide — Thunder Compute
LoRA and QLoRA — Analytics Vidhya
Fine-Tuning Infrastructure — Introl
Efficient Fine-Tuning with LoRA — Databricks
PEFT — Hugging Face GitHub
vLLM Production Deployment — Introl
vLLM Deep Dive — martinuke0
LLM Inference Optimization — Clarifai
Mastering LLM Inference Optimization — NVIDIA
KV Cache Management Survey
Optimizing Inference — Hugging Face
Context Window & Token Guide — QubitTool
LLM Tokenizers — DigitalOcean
LLM Serving Guide — Inference.net

Architecture Internals¶

Model Merging¶

VRAM & GPU¶

RAG¶

Benchmarks¶

Structured Output¶

Safety & Guardrails¶

Questions¶

How will post-transformer architectures like Mamba (state-space models) reshape the LLM landscape?
What is the practical floor for quantization before quality degrades unacceptably for agentic/tool-use workflows?
Will disaggregated prefill/decode (NVIDIA Dynamo, llm-d) become the default serving pattern?
How will Apple's MLX ecosystem evolve with M5/M6 unified memory scaling?