Architecture¶

Kubernetes follows a declarative, state-driven architecture. A cluster consists of a control plane (one or more machines running the management components) and worker nodes (machines running user workloads). All state is stored in etcd, and every component communicates through the API server.

Cluster Architecture Overview¶

graph TD
    subgraph ControlPlane["Control Plane"]
        API["kube-apiserver<br/>:6443"]
        ETCD["etcd<br/>:2379-2380"]
        SCHED["kube-scheduler<br/>:10259"]
        CM["kube-controller-manager<br/>:10257"]
        DNS["CoreDNS"]
    end

    subgraph Node1["Worker Node 1"]
        KUBELET1["kubelet<br/>:10250"]
        PROXY1["kube-proxy"]
        CRI1["CRI Runtime<br/>(containerd/CRI-O)"]
        POD1A["Pod A"]
        POD1B["Pod B"]
    end

    subgraph Node2["Worker Node 2"]
        KUBELET2["kubelet<br/>:10250"]
        PROXY2["kube-proxy"]
        CRI2["CRI Runtime<br/>(containerd/CRI-O)"]
        POD2A["Pod C"]
        POD2B["Pod D"]
    end

    API <--> ETCD
    SCHED --> API
    CM --> API
    DNS --> API
    KUBELET1 --> API
    PROXY1 --> API
    KUBELET2 --> API
    PROXY2 --> API
    KUBELET1 --> CRI1
    CRI1 --> POD1A
    CRI1 --> POD1B
    KUBELET2 --> CRI2
    CRI2 --> POD2A
    CRI2 --> POD2B

Control Plane Components¶

kube-apiserver¶

The API server is the front end for the Kubernetes control plane. It exposes the Kubernetes HTTP API on port 6443 and is the only component that communicates directly with etcd.

Validates and stores all API objects (pods, services, deployments, etc.)
Serves the REST API consumed by kubectl, controllers, and external tools
Handles authentication, authorization (RBAC), and admission controllers
Scales horizontally by running multiple replicas behind a load balancer
All other control plane components watch the API server, not etcd directly

etcd¶

etcd is a distributed, consistent key-value store that holds all cluster state:

Stores every object definition, cluster config, and dynamic state
Uses the Raft consensus algorithm for HA (quorum of 3 or 5 members)
Listens on TCP 2379 (client) and 2380 (peer communication)
Only the API server communicates with etcd directly
Backups are critical; use etcdctl snapshot save regularly

etcd Quorum

An etcd cluster requires a majority (quorum) to accept writes. A 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2. Never run production etcd with fewer than 3 members.

kube-scheduler¶

The scheduler watches for newly created pods that have no assigned node and selects an appropriate node for them:

Filtering: Eliminates nodes that do not meet requirements (resources, taints, node selectors)
Scoring: Ranks remaining nodes using priority functions (resource balance, affinity, anti-affinity)
Binding: Writes the selected node name to the pod object via the API server
Supports custom schedulers and scheduler extensions
Listens on port 10259 for health probes

kube-controller-manager¶

Runs the core reconciliation controllers in a single binary:

Node Controller: Monitors node health, responds to node failures
ReplicaSet Controller: Maintains desired replica count for pods
Deployment Controller: Manages rolling updates and rollbacks
StatefulSet Controller: Ordered creation/deletion of stateful pods
Job Controller: Manages batch job completion
EndpointSlice Controller: Populates service endpoint data
Service Account Controller: Creates default service accounts for new namespaces
Namespace Controller: Cleans up resources when namespaces are deleted

Each controller is a reconciliation loop that watches the API server for changes and drives current state toward desired state.

CoreDNS¶

CoreDNS provides cluster-wide DNS resolution for services and pods:

Resolves service names to cluster IPs (e.g., my-service.my-namespace.svc.cluster.local)
Supports DNS-based service discovery for headless services
Configurable via a Corefile (custom DNS zones, forwarding, rewrites)
Deployed as a Deployment in the kube-system namespace

Node Components¶

kubelet¶

The kubelet runs on every node and is responsible for pod lifecycle:

Watches the API server for pods assigned to its node
Pulls container images via the CRI runtime
Reports node and pod status back to the API server
Executes liveness, readiness, and startup probes
Mounts volumes (CSI) and configures pod networking (CNI)
Listens on port 10250 for API server health checks and exec/log requests

kube-proxy¶

kube-proxy maintains network rules on every node for service routing:

Implements the Service abstraction by managing iptables or IPVS rules
Routes traffic from service ClusterIP to backend pod IPs
Supports three modes: iptables (default), IPVS (high scale), userspace (legacy)
Watches the API server for changes to services and endpoints

CRI Runtime (containerd / CRI-O)¶

The container runtime interface (CRI) decouples kubelet from specific runtimes:

containerd: Most common runtime; CNCF graduated project
CRI-O: Lightweight runtime optimized for Kubernetes
Both implement the CRI API (gRPC) that kubelet calls
Container image pulling, container creation, and execution are delegated here
Dockershim was removed in Kubernetes v1.24

Container Network Interface (CNI)¶

CNI plugins handle pod network configuration:

Each pod gets its own IP address
Plugins: Calico, Cilium, Flannel, Weave, AWS VPC CNI, GKE Dataplane V2
Responsible for pod-to-pod connectivity within and across nodes
NetworkPolicy enforcement is plugin-dependent

Container Storage Interface (CSI)¶

CSI provides a standard interface for exposing storage to containers:

Replaces in-tree volume plugins (migration ongoing since v1.25)
Third-party drivers: AWS EBS CSI, GCE PD CSI, Azure Disk CSI, Ceph CSI
Handles volume provisioning, attachment, mounting, snapshots, and cloning

Container Runtime Interface (CRI)¶

The CRI is the gRPC API between kubelet and the container runtime:

Defines services: RuntimeService (container lifecycle) and ImageService (image management)
Enables pluggable runtimes without modifying kubelet source code
Supports both container runtimes and sandbox runtimes (Kata, gVisor)

Request Flow: Pod Creation¶

sequenceDiagram
    actor User
    participant API as kube-apiserver
    participant etcd as etcd
    participant CM as controller-manager
    participant Sched as kube-scheduler
    participant Kubelet as kubelet
    participant CRI as CRI runtime

    User->>API: kubectl apply -f pod.yaml
    API->>API: authenticate + authorize (RBAC)
    API->>API: run admission controllers (mutating, validating)
    API->>etcd: persist pod spec (nodeName="")
    etcd-->>API: confirmed
    API-->>User: accepted

    CM->>API: watch for new pods
    Note over CM: ReplicaSet controller ensures desired count

    Sched->>API: watch for unassigned pods
    API->>Sched: notify pod with nodeName=""
    Sched->>Sched: filter + score nodes
    Sched->>API: bind pod to node (set nodeName)
    API->>etcd: persist binding

    Kubelet->>API: watch for pods on its node
    API->>Kubelet: pod assigned to this node
    Kubelet->>CRI: pull image + create container
    CRI-->>Kubelet: container started
    Kubelet->>API: update pod status (Running)
    API->>etcd: persist status

Key Ports¶

Component	Port	Protocol	Purpose
kube-apiserver	6443	TCP	Kubernetes API
etcd	2379-2380	TCP	Client and peer communication
kubelet	10250	TCP	API server health, exec, logs
kube-scheduler	10259	TCP	Health probes
kube-controller-manager	10257	TCP	Health probes
kube-proxy	10256	TCP	Health probes

References¶

How It Works¶

Desired-state reconciliation, pod lifecycle, scheduling, networking, and storage internals.

Desired-State Reconciliation Loop¶

sequenceDiagram
    participant User as User / CI
    participant API as kube-apiserver
    participant ETCD as etcd
    participant Ctrl as Controller Manager
    participant Sched as Scheduler
    participant KL as kubelet (Node)
    participant CRI as containerd

    User->>API: kubectl apply -f deployment.yaml
    API->>ETCD: Store desired state
    Ctrl->>API: Watch: new Deployment
    Ctrl->>API: Create ReplicaSet
    Ctrl->>API: Create Pod specs
    Sched->>API: Watch: unscheduled Pods
    Sched->>Sched: Score nodes (resources, affinity, taints)
    Sched->>API: Bind Pod → Node
    KL->>API: Watch: Pod assigned to my node
    KL->>CRI: Create container sandbox
    CRI->>CRI: Pull image, start containers
    KL->>API: Update Pod status: Running

    loop Reconciliation
        Ctrl->>API: Watch: actual vs desired state
        Ctrl->>Ctrl: If replicas < desired → create more Pods
        Ctrl->>Ctrl: If replicas > desired → delete surplus
    end

Pod Lifecycle¶

stateDiagram-v2
    [*] --> Pending: Pod created
    Pending --> Running: Container started
    Running --> Succeeded: All containers exit 0
    Running --> Failed: Container exits non-zero
    Running --> Unknown: Node unreachable
    Failed --> [*]: Not restarted
    Succeeded --> [*]: Job complete
    Unknown --> Running: Node recovers
    Unknown --> Failed: Node dead (grace period)

    state Running {
        [*] --> Init: Init containers run sequentially
        Init --> Ready: Readiness probe passes
        Ready --> NotReady: Readiness probe fails
        NotReady --> Ready: Probe passes again
    }

Networking Model¶

The 4 Networking Rules¶

Pod-to-Pod: Every Pod gets its own IP. All Pods can communicate without NAT.
Pod-to-Service: Services provide stable virtual IPs (ClusterIP) backed by iptables/IPVS rules.
External-to-Service: LoadBalancer, NodePort, or Ingress/Gateway API expose services.
Pod-to-External: Pods can reach external networks via SNAT.

flowchart TB
    subgraph Cluster["Kubernetes Cluster"]
        subgraph Node1["Node 1"]
            P1["Pod A\n10.244.1.2"]
            P2["Pod B\n10.244.1.3"]
            KP1["kube-proxy\n(iptables/IPVS)"]
        end

        subgraph Node2["Node 2"]
            P3["Pod C\n10.244.2.2"]
            P4["Pod D\n10.244.2.3"]
            KP2["kube-proxy"]
        end

        SVC["Service: my-svc\nClusterIP: 10.96.0.10\n→ Pod A, Pod C"]

        CNI["CNI Plugin\n(Calico/Cilium/Flannel)\nPod-to-Pod routing"]
    end

    External["External\nTraffic"] -->|"LoadBalancer /\nNodePort"| SVC
    SVC -->|"iptables rules"| P1
    SVC -->|"iptables rules"| P3
    P1 <-->|"CNI overlay"| P3
    P2 <-->|"CNI overlay"| P4

    style Cluster fill:#326ce5,color:#fff

Storage Architecture¶

flowchart LR
    Pod["Pod"] --> PVC["PersistentVolumeClaim\n(request: 10Gi)"]
    PVC --> PV["PersistentVolume\n(10Gi, RWO)"]
    PV --> SC["StorageClass\n(provisioner: ebs.csi)"]
    SC --> CSI["CSI Driver\n(EBS, GCE PD, Ceph)"]
    CSI --> Disk["Cloud Disk\nor Storage"]

    style PVC fill:#326ce5,color:#fff

Scheduling Algorithm¶

Phase	Operation
Filtering	Eliminate nodes that don't meet Pod requirements (resources, taints, affinity)
Scoring	Rank remaining nodes: LeastRequestedPriority, BalancedResourceAllocation, NodeAffinityPriority, PodTopologySpread
Binding	Assign Pod to highest-scoring node
Preemption	If no node fits, evict lower-priority Pods

Sources¶

Benchmarks¶

Scope

Kubernetes scalability limits, API server performance, etcd throughput, and scheduling benchmarks.

Official Scalability Targets (SIG Scalability)¶

Kubernetes officially tests and targets these limits per cluster:

Dimension	Target	Notes
Nodes	5,000	Tested by SIG Scalability
Pods	150,000	30 pods/node avg
Pods per node	110	Kubelet default `maxPods`
Services	10,000
Endpoints per Service	5,000	Beyond this, use EndpointSlices
Namespaces	10,000
ConfigMaps	30,000
Secrets	30,000
Total API objects	~300,000	etcd storage limit

API Server Performance¶

Metric	Target SLO	Notes
API request latency (mutating, P99)	< 1s	At 5000-node scale
API request latency (non-mutating, P99)	< 5s	For resource-list calls
API request latency (P50)	< 100ms	Typical operations
Startup latency (P99)	< 5s	Pod ready from API call

etcd Performance¶

Cluster Size	WAL fsync P99	Read latency P99	Write QPS	Storage
< 100 nodes	< 5ms	< 10ms	1,000	2Gi
100-500 nodes	< 10ms	< 25ms	5,000	4Gi
500-5000 nodes	< 10ms	< 50ms	10,000	8Gi

Disk Latency is Critical

etcd requires sequential writes with fsync. Any disk with > 10ms fsync latency will cause leader elections, cluster instability, and cascading failures.

Scheduling Performance¶

Scheduler Metric	Value	Conditions
Scheduling throughput	~100 pods/sec	Default scheduler, 5000-node cluster
Scheduling latency (P99)	< 100ms	Without complex affinity rules
Scheduling with affinity	20-50 pods/sec	Pod anti-affinity across nodes
Preemption overhead	+50-100ms	When preemption kicks in

Network Performance (CNI Comparison)¶

CNI	Pod-to-Pod Latency	Throughput (TCP)	Throughput (eBPF)	Encryption Overhead
Cilium	~50us	9.5 Gbps	9.8 Gbps (native)	15-20% (WireGuard)
Calico	~60us	9.2 Gbps	9.5 Gbps (eBPF mode)	20-25% (WireGuard)
Flannel (VXLAN)	~80us	8.5 Gbps	N/A	N/A (no native)
Host networking	~30us	10 Gbps	N/A	N/A

Real-World Scale References¶

Google GKE: Supports 15,000 nodes per cluster (managed)
AWS EKS: Up to 5,000 nodes with managed control plane
OpenAI: Runs 7,500-node clusters for ML training
Alibaba Cloud: Reported testing at 10,000+ nodes

Sourcing Status¶

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.