Skip to content

LLM Fundamentals

A comprehensive reference covering how Large Language Models work — from transformer internals and training pipelines to quantization formats, model serving, and production deployment.

Summary

Large Language Models (LLMs) are neural networks trained on internet-scale text data to predict the next token in a sequence. Built on the transformer architecture (Vaswani et al., 2017), they use self-attention mechanisms to process relationships between all elements in a sequence simultaneously. When scaled to billions of parameters and trillions of training tokens, emergent capabilities like reasoning, code generation, and multi-step planning appear.

The modern LLM lifecycle spans four major phases:

  1. Pretraining — next-token prediction on trillions of tokens to build a broad knowledge base
  2. Supervised Fine-Tuning (SFT) — instruction-following on curated prompt-response pairs
  3. Alignment — RLHF or DPO to align outputs with human preferences
  4. Deployment — quantization, serving engines, distributed inference, and scaling

Key Concepts at a Glance

Concept What It Is Details
Transformer Core neural network architecture architecture#transformer-architecture
Self-Attention Mechanism to weigh token relationships architecture#self-attention-mechanism
MoE Mixture of Experts — sparse activation architecture#mixture-of-experts-moe
Tokenization Breaking text into sub-word units (BPE) architecture#tokenization-and-embeddings
Quantization Reducing weight precision (FP16→INT4) architecture#quantization-formats
GGUF / GPTQ / AWQ Model file and quantization formats architecture#model-formats-and-quantization-methods
MLX Apple Silicon ML framework architecture#mlx-apple-silicon
Distillation Teacher-student model compression architecture#knowledge-distillation
LoRA / QLoRA Parameter-efficient fine-tuning operations#parameter-efficient-fine-tuning-peft
KV Cache Key-Value cache for inference speedup operations#inference-optimization
vLLM / TensorRT Production serving engines operations#serving-engines

Evaluation

Dimension Rating Notes
Maturity High Transformer architecture is battle-tested since 2017; MoE dominant since 2024
Ecosystem Massive Hugging Face, llama.cpp, vLLM, Ollama, MLX, NVIDIA TensorRT
Accessibility Improving QLoRA enables fine-tuning 65B models on a single 48GB GPU
Local Inference Strong GGUF + llama.cpp / MLX run 7B-30B models on consumer hardware

Sources

Questions

  • How will post-transformer architectures like Mamba (state-space models) reshape the LLM landscape?
  • What is the practical floor for quantization before quality degrades unacceptably for agentic/tool-use workflows?
  • Will disaggregated prefill/decode (NVIDIA Dynamo, llm-d) become the default serving pattern?
  • How will Apple's MLX ecosystem evolve with M5/M6 unified memory scaling?