Architecture¶
Kubernetes follows a declarative, state-driven architecture. A cluster consists of a control plane (one or more machines running the management components) and worker nodes (machines running user workloads). All state is stored in etcd, and every component communicates through the API server.
See also: infrastructure/kubernetes/index, infrastructure/kubernetes/architecture, infrastructure/kubernetes/operations, infrastructure/kubernetes/security
Cluster Architecture Overview¶
graph TD
subgraph ControlPlane["Control Plane"]
API["kube-apiserver<br/>:6443"]
ETCD["etcd<br/>:2379-2380"]
SCHED["kube-scheduler<br/>:10259"]
CM["kube-controller-manager<br/>:10257"]
DNS["CoreDNS"]
end
subgraph Node1["Worker Node 1"]
KUBELET1["kubelet<br/>:10250"]
PROXY1["kube-proxy"]
CRI1["CRI Runtime<br/>(containerd/CRI-O)"]
POD1A["Pod A"]
POD1B["Pod B"]
end
subgraph Node2["Worker Node 2"]
KUBELET2["kubelet<br/>:10250"]
PROXY2["kube-proxy"]
CRI2["CRI Runtime<br/>(containerd/CRI-O)"]
POD2A["Pod C"]
POD2B["Pod D"]
end
API <--> ETCD
SCHED --> API
CM --> API
DNS --> API
KUBELET1 --> API
PROXY1 --> API
KUBELET2 --> API
PROXY2 --> API
KUBELET1 --> CRI1
CRI1 --> POD1A
CRI1 --> POD1B
KUBELET2 --> CRI2
CRI2 --> POD2A
CRI2 --> POD2B
Control Plane Components¶
kube-apiserver¶
The API server is the front end for the Kubernetes control plane. It exposes the Kubernetes HTTP API on port 6443 and is the only component that communicates directly with etcd.
- Validates and stores all API objects (pods, services, deployments, etc.)
- Serves the REST API consumed by kubectl, controllers, and external tools
- Handles authentication, authorization (RBAC), and admission controllers
- Scales horizontally by running multiple replicas behind a load balancer
- All other control plane components watch the API server, not etcd directly
etcd¶
etcd is a distributed, consistent key-value store that holds all cluster state:
- Stores every object definition, cluster config, and dynamic state
- Uses the Raft consensus algorithm for HA (quorum of 3 or 5 members)
- Listens on TCP 2379 (client) and 2380 (peer communication)
- Only the API server communicates with etcd directly
- Backups are critical; use
etcdctl snapshot saveregularly
etcd Quorum
An etcd cluster requires a majority (quorum) to accept writes. A 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2. Never run production etcd with fewer than 3 members.
kube-scheduler¶
The scheduler watches for newly created pods that have no assigned node and selects an appropriate node for them:
- Filtering: Eliminates nodes that do not meet requirements (resources, taints, node selectors)
- Scoring: Ranks remaining nodes using priority functions (resource balance, affinity, anti-affinity)
- Binding: Writes the selected node name to the pod object via the API server
- Supports custom schedulers and scheduler extensions
- Listens on port 10259 for health probes
kube-controller-manager¶
Runs the core reconciliation controllers in a single binary:
- Node Controller: Monitors node health, responds to node failures
- ReplicaSet Controller: Maintains desired replica count for pods
- Deployment Controller: Manages rolling updates and rollbacks
- StatefulSet Controller: Ordered creation/deletion of stateful pods
- Job Controller: Manages batch job completion
- EndpointSlice Controller: Populates service endpoint data
- Service Account Controller: Creates default service accounts for new namespaces
- Namespace Controller: Cleans up resources when namespaces are deleted
Each controller is a reconciliation loop that watches the API server for changes and drives current state toward desired state.
CoreDNS¶
CoreDNS provides cluster-wide DNS resolution for services and pods:
- Resolves service names to cluster IPs (e.g.,
my-service.my-namespace.svc.cluster.local) - Supports DNS-based service discovery for headless services
- Configurable via a Corefile (custom DNS zones, forwarding, rewrites)
- Deployed as a Deployment in the
kube-systemnamespace
Node Components¶
kubelet¶
The kubelet runs on every node and is responsible for pod lifecycle:
- Watches the API server for pods assigned to its node
- Pulls container images via the CRI runtime
- Reports node and pod status back to the API server
- Executes liveness, readiness, and startup probes
- Mounts volumes (CSI) and configures pod networking (CNI)
- Listens on port 10250 for API server health checks and exec/log requests
kube-proxy¶
kube-proxy maintains network rules on every node for service routing:
- Implements the Service abstraction by managing iptables or IPVS rules
- Routes traffic from service ClusterIP to backend pod IPs
- Supports three modes: iptables (default), IPVS (high scale), userspace (legacy)
- Watches the API server for changes to services and endpoints
CRI Runtime (containerd / CRI-O)¶
The container runtime interface (CRI) decouples kubelet from specific runtimes:
- containerd: Most common runtime; CNCF graduated project
- CRI-O: Lightweight runtime optimized for Kubernetes
- Both implement the CRI API (gRPC) that kubelet calls
- Container image pulling, container creation, and execution are delegated here
- Dockershim was removed in Kubernetes v1.24
Container Network Interface (CNI)¶
CNI plugins handle pod network configuration:
- Each pod gets its own IP address
- Plugins: Calico, Cilium, Flannel, Weave, AWS VPC CNI, GKE Dataplane V2
- Responsible for pod-to-pod connectivity within and across nodes
- NetworkPolicy enforcement is plugin-dependent
Container Storage Interface (CSI)¶
CSI provides a standard interface for exposing storage to containers:
- Replaces in-tree volume plugins (migration ongoing since v1.25)
- Third-party drivers: AWS EBS CSI, GCE PD CSI, Azure Disk CSI, Ceph CSI
- Handles volume provisioning, attachment, mounting, snapshots, and cloning
Container Runtime Interface (CRI)¶
The CRI is the gRPC API between kubelet and the container runtime:
- Defines services:
RuntimeService(container lifecycle) andImageService(image management) - Enables pluggable runtimes without modifying kubelet source code
- Supports both container runtimes and sandbox runtimes (Kata, gVisor)
Request Flow: Pod Creation¶
sequenceDiagram
actor User
participant API as kube-apiserver
participant etcd as etcd
participant CM as controller-manager
participant Sched as kube-scheduler
participant Kubelet as kubelet
participant CRI as CRI runtime
User->>API: kubectl apply -f pod.yaml
API->>API: authenticate + authorize (RBAC)
API->>API: run admission controllers (mutating, validating)
API->>etcd: persist pod spec (nodeName="")
etcd-->>API: confirmed
API-->>User: accepted
CM->>API: watch for new pods
Note over CM: ReplicaSet controller ensures desired count
Sched->>API: watch for unassigned pods
API->>Sched: notify pod with nodeName=""
Sched->>Sched: filter + score nodes
Sched->>API: bind pod to node (set nodeName)
API->>etcd: persist binding
Kubelet->>API: watch for pods on its node
API->>Kubelet: pod assigned to this node
Kubelet->>CRI: pull image + create container
CRI-->>Kubelet: container started
Kubelet->>API: update pod status (Running)
API->>etcd: persist status
Key Ports¶
| Component | Port | Protocol | Purpose |
|---|---|---|---|
| kube-apiserver | 6443 | TCP | Kubernetes API |
| etcd | 2379-2380 | TCP | Client and peer communication |
| kubelet | 10250 | TCP | API server health, exec, logs |
| kube-scheduler | 10259 | TCP | Health probes |
| kube-controller-manager | 10257 | TCP | Health probes |
| kube-proxy | 10256 | TCP | Health probes |
References¶
How It Works¶
Desired-state reconciliation, pod lifecycle, scheduling, networking, and storage internals.
Desired-State Reconciliation Loop¶
sequenceDiagram
participant User as User / CI
participant API as kube-apiserver
participant ETCD as etcd
participant Ctrl as Controller Manager
participant Sched as Scheduler
participant KL as kubelet (Node)
participant CRI as containerd
User->>API: kubectl apply -f deployment.yaml
API->>ETCD: Store desired state
Ctrl->>API: Watch: new Deployment
Ctrl->>API: Create ReplicaSet
Ctrl->>API: Create Pod specs
Sched->>API: Watch: unscheduled Pods
Sched->>Sched: Score nodes (resources, affinity, taints)
Sched->>API: Bind Pod → Node
KL->>API: Watch: Pod assigned to my node
KL->>CRI: Create container sandbox
CRI->>CRI: Pull image, start containers
KL->>API: Update Pod status: Running
loop Reconciliation
Ctrl->>API: Watch: actual vs desired state
Ctrl->>Ctrl: If replicas < desired → create more Pods
Ctrl->>Ctrl: If replicas > desired → delete surplus
end
Pod Lifecycle¶
stateDiagram-v2
[*] --> Pending: Pod created
Pending --> Running: Container started
Running --> Succeeded: All containers exit 0
Running --> Failed: Container exits non-zero
Running --> Unknown: Node unreachable
Failed --> [*]: Not restarted
Succeeded --> [*]: Job complete
Unknown --> Running: Node recovers
Unknown --> Failed: Node dead (grace period)
state Running {
[*] --> Init: Init containers run sequentially
Init --> Ready: Readiness probe passes
Ready --> NotReady: Readiness probe fails
NotReady --> Ready: Probe passes again
}
Networking Model¶
The 4 Networking Rules¶
- Pod-to-Pod: Every Pod gets its own IP. All Pods can communicate without NAT.
- Pod-to-Service: Services provide stable virtual IPs (ClusterIP) backed by iptables/IPVS rules.
- External-to-Service: LoadBalancer, NodePort, or Ingress/Gateway API expose services.
- Pod-to-External: Pods can reach external networks via SNAT.
flowchart TB
subgraph Cluster["Kubernetes Cluster"]
subgraph Node1["Node 1"]
P1["Pod A\n10.244.1.2"]
P2["Pod B\n10.244.1.3"]
KP1["kube-proxy\n(iptables/IPVS)"]
end
subgraph Node2["Node 2"]
P3["Pod C\n10.244.2.2"]
P4["Pod D\n10.244.2.3"]
KP2["kube-proxy"]
end
SVC["Service: my-svc\nClusterIP: 10.96.0.10\n→ Pod A, Pod C"]
CNI["CNI Plugin\n(Calico/Cilium/Flannel)\nPod-to-Pod routing"]
end
External["External\nTraffic"] -->|"LoadBalancer /\nNodePort"| SVC
SVC -->|"iptables rules"| P1
SVC -->|"iptables rules"| P3
P1 <-->|"CNI overlay"| P3
P2 <-->|"CNI overlay"| P4
style Cluster fill:#326ce5,color:#fff
Storage Architecture¶
flowchart LR
Pod["Pod"] --> PVC["PersistentVolumeClaim\n(request: 10Gi)"]
PVC --> PV["PersistentVolume\n(10Gi, RWO)"]
PV --> SC["StorageClass\n(provisioner: ebs.csi)"]
SC --> CSI["CSI Driver\n(EBS, GCE PD, Ceph)"]
CSI --> Disk["Cloud Disk\nor Storage"]
style PVC fill:#326ce5,color:#fff
Scheduling Algorithm¶
| Phase | Operation |
|---|---|
| Filtering | Eliminate nodes that don't meet Pod requirements (resources, taints, affinity) |
| Scoring | Rank remaining nodes: LeastRequestedPriority, BalancedResourceAllocation, NodeAffinityPriority, PodTopologySpread |
| Binding | Assign Pod to highest-scoring node |
| Preemption | If no node fits, evict lower-priority Pods |
Sources¶
Benchmarks¶
Scope
Kubernetes scalability limits, API server performance, etcd throughput, and scheduling benchmarks.
Official Scalability Targets (SIG Scalability)¶
Kubernetes officially tests and targets these limits per cluster:
| Dimension | Target | Notes |
|---|---|---|
| Nodes | 5,000 | Tested by SIG Scalability |
| Pods | 150,000 | 30 pods/node avg |
| Pods per node | 110 | Kubelet default maxPods |
| Services | 10,000 | |
| Endpoints per Service | 5,000 | Beyond this, use EndpointSlices |
| Namespaces | 10,000 | |
| ConfigMaps | 30,000 | |
| Secrets | 30,000 | |
| Total API objects | ~300,000 | etcd storage limit |
API Server Performance¶
| Metric | Target SLO | Notes |
|---|---|---|
| API request latency (mutating, P99) | < 1s | At 5000-node scale |
| API request latency (non-mutating, P99) | < 5s | For resource-list calls |
| API request latency (P50) | < 100ms | Typical operations |
| Startup latency (P99) | < 5s | Pod ready from API call |
etcd Performance¶
| Cluster Size | WAL fsync P99 | Read latency P99 | Write QPS | Storage |
|---|---|---|---|---|
| < 100 nodes | < 5ms | < 10ms | 1,000 | 2Gi |
| 100-500 nodes | < 10ms | < 25ms | 5,000 | 4Gi |
| 500-5000 nodes | < 10ms | < 50ms | 10,000 | 8Gi |
Disk Latency is Critical
etcd requires sequential writes with fsync. Any disk with > 10ms fsync latency will cause leader elections, cluster instability, and cascading failures.
Scheduling Performance¶
| Scheduler Metric | Value | Conditions |
|---|---|---|
| Scheduling throughput | ~100 pods/sec | Default scheduler, 5000-node cluster |
| Scheduling latency (P99) | < 100ms | Without complex affinity rules |
| Scheduling with affinity | 20-50 pods/sec | Pod anti-affinity across nodes |
| Preemption overhead | +50-100ms | When preemption kicks in |
Network Performance (CNI Comparison)¶
| CNI | Pod-to-Pod Latency | Throughput (TCP) | Throughput (eBPF) | Encryption Overhead |
|---|---|---|---|---|
| Cilium | ~50us | 9.5 Gbps | 9.8 Gbps (native) | 15-20% (WireGuard) |
| Calico | ~60us | 9.2 Gbps | 9.5 Gbps (eBPF mode) | 20-25% (WireGuard) |
| Flannel (VXLAN) | ~80us | 8.5 Gbps | N/A | N/A (no native) |
| Host networking | ~30us | 10 Gbps | N/A | N/A |
Real-World Scale References¶
- Google GKE: Supports 15,000 nodes per cluster (managed)
- AWS EKS: Up to 5,000 nodes with managed control plane
- OpenAI: Runs 7,500-node clusters for ML training
- Alibaba Cloud: Reported testing at 10,000+ nodes
Sourcing Status¶
Unsourced Performance Data
The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.
Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.