Skip to content

Architecture

Kubernetes follows a declarative, state-driven architecture. A cluster consists of a control plane (one or more machines running the management components) and worker nodes (machines running user workloads). All state is stored in etcd, and every component communicates through the API server.

See also: infrastructure/kubernetes/index, infrastructure/kubernetes/architecture, infrastructure/kubernetes/operations, infrastructure/kubernetes/security

Cluster Architecture Overview

graph TD
    subgraph ControlPlane["Control Plane"]
        API["kube-apiserver<br/>:6443"]
        ETCD["etcd<br/>:2379-2380"]
        SCHED["kube-scheduler<br/>:10259"]
        CM["kube-controller-manager<br/>:10257"]
        DNS["CoreDNS"]
    end

    subgraph Node1["Worker Node 1"]
        KUBELET1["kubelet<br/>:10250"]
        PROXY1["kube-proxy"]
        CRI1["CRI Runtime<br/>(containerd/CRI-O)"]
        POD1A["Pod A"]
        POD1B["Pod B"]
    end

    subgraph Node2["Worker Node 2"]
        KUBELET2["kubelet<br/>:10250"]
        PROXY2["kube-proxy"]
        CRI2["CRI Runtime<br/>(containerd/CRI-O)"]
        POD2A["Pod C"]
        POD2B["Pod D"]
    end

    API <--> ETCD
    SCHED --> API
    CM --> API
    DNS --> API
    KUBELET1 --> API
    PROXY1 --> API
    KUBELET2 --> API
    PROXY2 --> API
    KUBELET1 --> CRI1
    CRI1 --> POD1A
    CRI1 --> POD1B
    KUBELET2 --> CRI2
    CRI2 --> POD2A
    CRI2 --> POD2B

Control Plane Components

kube-apiserver

The API server is the front end for the Kubernetes control plane. It exposes the Kubernetes HTTP API on port 6443 and is the only component that communicates directly with etcd.

  • Validates and stores all API objects (pods, services, deployments, etc.)
  • Serves the REST API consumed by kubectl, controllers, and external tools
  • Handles authentication, authorization (RBAC), and admission controllers
  • Scales horizontally by running multiple replicas behind a load balancer
  • All other control plane components watch the API server, not etcd directly

etcd

etcd is a distributed, consistent key-value store that holds all cluster state:

  • Stores every object definition, cluster config, and dynamic state
  • Uses the Raft consensus algorithm for HA (quorum of 3 or 5 members)
  • Listens on TCP 2379 (client) and 2380 (peer communication)
  • Only the API server communicates with etcd directly
  • Backups are critical; use etcdctl snapshot save regularly

etcd Quorum

An etcd cluster requires a majority (quorum) to accept writes. A 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2. Never run production etcd with fewer than 3 members.

kube-scheduler

The scheduler watches for newly created pods that have no assigned node and selects an appropriate node for them:

  • Filtering: Eliminates nodes that do not meet requirements (resources, taints, node selectors)
  • Scoring: Ranks remaining nodes using priority functions (resource balance, affinity, anti-affinity)
  • Binding: Writes the selected node name to the pod object via the API server
  • Supports custom schedulers and scheduler extensions
  • Listens on port 10259 for health probes

kube-controller-manager

Runs the core reconciliation controllers in a single binary:

  • Node Controller: Monitors node health, responds to node failures
  • ReplicaSet Controller: Maintains desired replica count for pods
  • Deployment Controller: Manages rolling updates and rollbacks
  • StatefulSet Controller: Ordered creation/deletion of stateful pods
  • Job Controller: Manages batch job completion
  • EndpointSlice Controller: Populates service endpoint data
  • Service Account Controller: Creates default service accounts for new namespaces
  • Namespace Controller: Cleans up resources when namespaces are deleted

Each controller is a reconciliation loop that watches the API server for changes and drives current state toward desired state.

CoreDNS

CoreDNS provides cluster-wide DNS resolution for services and pods:

  • Resolves service names to cluster IPs (e.g., my-service.my-namespace.svc.cluster.local)
  • Supports DNS-based service discovery for headless services
  • Configurable via a Corefile (custom DNS zones, forwarding, rewrites)
  • Deployed as a Deployment in the kube-system namespace

Node Components

kubelet

The kubelet runs on every node and is responsible for pod lifecycle:

  • Watches the API server for pods assigned to its node
  • Pulls container images via the CRI runtime
  • Reports node and pod status back to the API server
  • Executes liveness, readiness, and startup probes
  • Mounts volumes (CSI) and configures pod networking (CNI)
  • Listens on port 10250 for API server health checks and exec/log requests

kube-proxy

kube-proxy maintains network rules on every node for service routing:

  • Implements the Service abstraction by managing iptables or IPVS rules
  • Routes traffic from service ClusterIP to backend pod IPs
  • Supports three modes: iptables (default), IPVS (high scale), userspace (legacy)
  • Watches the API server for changes to services and endpoints

CRI Runtime (containerd / CRI-O)

The container runtime interface (CRI) decouples kubelet from specific runtimes:

  • containerd: Most common runtime; CNCF graduated project
  • CRI-O: Lightweight runtime optimized for Kubernetes
  • Both implement the CRI API (gRPC) that kubelet calls
  • Container image pulling, container creation, and execution are delegated here
  • Dockershim was removed in Kubernetes v1.24

Container Network Interface (CNI)

CNI plugins handle pod network configuration:

  • Each pod gets its own IP address
  • Plugins: Calico, Cilium, Flannel, Weave, AWS VPC CNI, GKE Dataplane V2
  • Responsible for pod-to-pod connectivity within and across nodes
  • NetworkPolicy enforcement is plugin-dependent

Container Storage Interface (CSI)

CSI provides a standard interface for exposing storage to containers:

  • Replaces in-tree volume plugins (migration ongoing since v1.25)
  • Third-party drivers: AWS EBS CSI, GCE PD CSI, Azure Disk CSI, Ceph CSI
  • Handles volume provisioning, attachment, mounting, snapshots, and cloning

Container Runtime Interface (CRI)

The CRI is the gRPC API between kubelet and the container runtime:

  • Defines services: RuntimeService (container lifecycle) and ImageService (image management)
  • Enables pluggable runtimes without modifying kubelet source code
  • Supports both container runtimes and sandbox runtimes (Kata, gVisor)

Request Flow: Pod Creation

sequenceDiagram
    actor User
    participant API as kube-apiserver
    participant etcd as etcd
    participant CM as controller-manager
    participant Sched as kube-scheduler
    participant Kubelet as kubelet
    participant CRI as CRI runtime

    User->>API: kubectl apply -f pod.yaml
    API->>API: authenticate + authorize (RBAC)
    API->>API: run admission controllers (mutating, validating)
    API->>etcd: persist pod spec (nodeName="")
    etcd-->>API: confirmed
    API-->>User: accepted

    CM->>API: watch for new pods
    Note over CM: ReplicaSet controller ensures desired count

    Sched->>API: watch for unassigned pods
    API->>Sched: notify pod with nodeName=""
    Sched->>Sched: filter + score nodes
    Sched->>API: bind pod to node (set nodeName)
    API->>etcd: persist binding

    Kubelet->>API: watch for pods on its node
    API->>Kubelet: pod assigned to this node
    Kubelet->>CRI: pull image + create container
    CRI-->>Kubelet: container started
    Kubelet->>API: update pod status (Running)
    API->>etcd: persist status

Key Ports

Component Port Protocol Purpose
kube-apiserver 6443 TCP Kubernetes API
etcd 2379-2380 TCP Client and peer communication
kubelet 10250 TCP API server health, exec, logs
kube-scheduler 10259 TCP Health probes
kube-controller-manager 10257 TCP Health probes
kube-proxy 10256 TCP Health probes

References


How It Works

Desired-state reconciliation, pod lifecycle, scheduling, networking, and storage internals.

Desired-State Reconciliation Loop

sequenceDiagram
    participant User as User / CI
    participant API as kube-apiserver
    participant ETCD as etcd
    participant Ctrl as Controller Manager
    participant Sched as Scheduler
    participant KL as kubelet (Node)
    participant CRI as containerd

    User->>API: kubectl apply -f deployment.yaml
    API->>ETCD: Store desired state
    Ctrl->>API: Watch: new Deployment
    Ctrl->>API: Create ReplicaSet
    Ctrl->>API: Create Pod specs
    Sched->>API: Watch: unscheduled Pods
    Sched->>Sched: Score nodes (resources, affinity, taints)
    Sched->>API: Bind Pod → Node
    KL->>API: Watch: Pod assigned to my node
    KL->>CRI: Create container sandbox
    CRI->>CRI: Pull image, start containers
    KL->>API: Update Pod status: Running

    loop Reconciliation
        Ctrl->>API: Watch: actual vs desired state
        Ctrl->>Ctrl: If replicas < desired → create more Pods
        Ctrl->>Ctrl: If replicas > desired → delete surplus
    end

Pod Lifecycle

stateDiagram-v2
    [*] --> Pending: Pod created
    Pending --> Running: Container started
    Running --> Succeeded: All containers exit 0
    Running --> Failed: Container exits non-zero
    Running --> Unknown: Node unreachable
    Failed --> [*]: Not restarted
    Succeeded --> [*]: Job complete
    Unknown --> Running: Node recovers
    Unknown --> Failed: Node dead (grace period)

    state Running {
        [*] --> Init: Init containers run sequentially
        Init --> Ready: Readiness probe passes
        Ready --> NotReady: Readiness probe fails
        NotReady --> Ready: Probe passes again
    }

Networking Model

The 4 Networking Rules

  1. Pod-to-Pod: Every Pod gets its own IP. All Pods can communicate without NAT.
  2. Pod-to-Service: Services provide stable virtual IPs (ClusterIP) backed by iptables/IPVS rules.
  3. External-to-Service: LoadBalancer, NodePort, or Ingress/Gateway API expose services.
  4. Pod-to-External: Pods can reach external networks via SNAT.
flowchart TB
    subgraph Cluster["Kubernetes Cluster"]
        subgraph Node1["Node 1"]
            P1["Pod A\n10.244.1.2"]
            P2["Pod B\n10.244.1.3"]
            KP1["kube-proxy\n(iptables/IPVS)"]
        end

        subgraph Node2["Node 2"]
            P3["Pod C\n10.244.2.2"]
            P4["Pod D\n10.244.2.3"]
            KP2["kube-proxy"]
        end

        SVC["Service: my-svc\nClusterIP: 10.96.0.10\n→ Pod A, Pod C"]

        CNI["CNI Plugin\n(Calico/Cilium/Flannel)\nPod-to-Pod routing"]
    end

    External["External\nTraffic"] -->|"LoadBalancer /\nNodePort"| SVC
    SVC -->|"iptables rules"| P1
    SVC -->|"iptables rules"| P3
    P1 <-->|"CNI overlay"| P3
    P2 <-->|"CNI overlay"| P4

    style Cluster fill:#326ce5,color:#fff

Storage Architecture

flowchart LR
    Pod["Pod"] --> PVC["PersistentVolumeClaim\n(request: 10Gi)"]
    PVC --> PV["PersistentVolume\n(10Gi, RWO)"]
    PV --> SC["StorageClass\n(provisioner: ebs.csi)"]
    SC --> CSI["CSI Driver\n(EBS, GCE PD, Ceph)"]
    CSI --> Disk["Cloud Disk\nor Storage"]

    style PVC fill:#326ce5,color:#fff

Scheduling Algorithm

Phase Operation
Filtering Eliminate nodes that don't meet Pod requirements (resources, taints, affinity)
Scoring Rank remaining nodes: LeastRequestedPriority, BalancedResourceAllocation, NodeAffinityPriority, PodTopologySpread
Binding Assign Pod to highest-scoring node
Preemption If no node fits, evict lower-priority Pods

Sources


Benchmarks

Scope

Kubernetes scalability limits, API server performance, etcd throughput, and scheduling benchmarks.

Official Scalability Targets (SIG Scalability)

Kubernetes officially tests and targets these limits per cluster:

Dimension Target Notes
Nodes 5,000 Tested by SIG Scalability
Pods 150,000 30 pods/node avg
Pods per node 110 Kubelet default maxPods
Services 10,000
Endpoints per Service 5,000 Beyond this, use EndpointSlices
Namespaces 10,000
ConfigMaps 30,000
Secrets 30,000
Total API objects ~300,000 etcd storage limit

API Server Performance

Metric Target SLO Notes
API request latency (mutating, P99) < 1s At 5000-node scale
API request latency (non-mutating, P99) < 5s For resource-list calls
API request latency (P50) < 100ms Typical operations
Startup latency (P99) < 5s Pod ready from API call

etcd Performance

Cluster Size WAL fsync P99 Read latency P99 Write QPS Storage
< 100 nodes < 5ms < 10ms 1,000 2Gi
100-500 nodes < 10ms < 25ms 5,000 4Gi
500-5000 nodes < 10ms < 50ms 10,000 8Gi

Disk Latency is Critical

etcd requires sequential writes with fsync. Any disk with > 10ms fsync latency will cause leader elections, cluster instability, and cascading failures.

Scheduling Performance

Scheduler Metric Value Conditions
Scheduling throughput ~100 pods/sec Default scheduler, 5000-node cluster
Scheduling latency (P99) < 100ms Without complex affinity rules
Scheduling with affinity 20-50 pods/sec Pod anti-affinity across nodes
Preemption overhead +50-100ms When preemption kicks in

Network Performance (CNI Comparison)

CNI Pod-to-Pod Latency Throughput (TCP) Throughput (eBPF) Encryption Overhead
Cilium ~50us 9.5 Gbps 9.8 Gbps (native) 15-20% (WireGuard)
Calico ~60us 9.2 Gbps 9.5 Gbps (eBPF mode) 20-25% (WireGuard)
Flannel (VXLAN) ~80us 8.5 Gbps N/A N/A (no native)
Host networking ~30us 10 Gbps N/A N/A

Real-World Scale References

  • Google GKE: Supports 15,000 nodes per cluster (managed)
  • AWS EKS: Up to 5,000 nodes with managed control plane
  • OpenAI: Runs 7,500-node clusters for ML training
  • Alibaba Cloud: Reported testing at 10,000+ nodes

Sourcing Status

Unsourced Performance Data

The performance numbers in this document are estimated from vendor documentation, community benchmarks, and engineering judgment. They do not represent controlled benchmarks with documented test conditions. Specific hardware configurations, software versions, and test methodologies were not recorded.

Use these figures as rough guidance only. For production capacity planning, run your own benchmarks against your specific workload and infrastructure.

Sources