Skip to content

Kubernetes

The industry-standard container orchestration platform for automating deployment, scaling, and management of containerized workloads across clusters of machines.

Overview

Kubernetes (K8s) is a production-grade container orchestration system originally developed by Google and now maintained by the Cloud Native Computing Foundation (CNCF). It implements a desired-state model where controllers continuously reconcile actual state with declared intent. Kubernetes manages the full lifecycle of containerized applications: scheduling, scaling, networking, storage, and self-healing.

Repository & Community

Attribute Detail
Repository github.com/kubernetes/kubernetes
Stars ~115k+ ⭐
Latest Stable v1.35.3 (April 2026); v1.36 due April 22, 2026
Language Go
License Apache 2.0
Governance CNCF (Linux Foundation)
Contributors 9,000+

Evaluation

  • Why it's better: Cloud-agnostic, massive ecosystem (CNCF landscape), declarative config, self-healing, horizontal pod autoscaling, service discovery, rolling updates, and the dominant industry standard for container orchestration.

  • When it fits (Applicability):

  • Microservices at scale across multiple nodes
  • CI/CD with automated rollouts and rollbacks
  • Multi-cloud / hybrid cloud portability
  • Stateful workloads with persistent volumes
  • AI/ML training and inference pipelines
  • Edge deployments (K3s, MicroK8s)

  • Pros and Cons:

Pros Cons
Cloud-agnostic, runs anywhere Steep learning curve
Self-healing, auto-scaling Complex networking (CNI plugins)
Massive CNCF ecosystem Control plane overhead for small workloads
Declarative desired-state model YAML verbosity
Service mesh, Ingress, Gateway API Security hardening requires expertise
GPU/DRA scheduling for AI/ML etcd operational complexity
Every major cloud offers managed K8s Not ideal for traditional VM workloads

Architecture

flowchart TB
    subgraph ControlPlane["Control Plane"]
        API["kube-apiserver\n(REST + gRPC)"]
        ETCD["etcd\n(distributed KV store)"]
        Sched["kube-scheduler\n(pod placement)"]
        CM["kube-controller-manager\n(reconciliation loops)"]
        CCM["cloud-controller-manager\n(cloud API integration)"]
    end

    subgraph WorkerNode["Worker Node"]
        Kubelet["kubelet\n(pod lifecycle)"]
        KProxy["kube-proxy\n(Service networking)"]
        CRI["Container Runtime\n(containerd / CRI-O)"]
        Pods["Pods\n(application containers)"]
    end

    API <-->|"watch/list"| ETCD
    Sched -->|"bind pod"| API
    CM -->|"reconcile"| API
    CCM -->|"cloud ops"| API
    Kubelet -->|"status"| API
    API -->|"spec"| Kubelet
    Kubelet -->|"CRI"| CRI
    CRI --> Pods
    KProxy -->|"iptables/IPVS"| Pods

    style ControlPlane fill:#326ce5,color:#fff
    style WorkerNode fill:#1565c0,color:#fff

Key Features

Feature Detail
Pod Scheduling Affinity, anti-affinity, taints, tolerations, topology spread
Auto-Scaling HPA (horizontal), VPA (vertical), Cluster Autoscaler
Service Discovery ClusterIP, NodePort, LoadBalancer, ExternalName
Ingress / Gateway API L7 traffic routing, TLS termination
Storage PV, PVC, CSI drivers, StorageClasses
ConfigMaps / Secrets Externalized configuration and credentials
RBAC Fine-grained role-based access control
Namespaces Logical cluster partitioning
DRA (v1.36) Dynamic Resource Allocation for GPUs, FPGAs
Custom Resources Extend API with CRDs + Operators

Key Ecosystem

Category Tools
Managed K8s EKS, GKE, AKS, DOKS, OKE, Linode LKE
Lightweight K3s, MicroK8s, Kind, Minikube
Networking Calico, Cilium, Flannel, Antrea
Service Mesh Istio, Linkerd, Consul Connect
GitOps ArgoCD, Flux
Observability Prometheus, Grafana, OpenTelemetry
Security Falco, OPA/Gatekeeper, Trivy, Kyverno

Pricing

Offering Cost Notes
Self-hosted Free (Apache 2.0) You manage everything
AWS EKS $0.10/hr/cluster + node costs Managed control plane
GKE Autopilot $0.10/hr/cluster + pod costs Fully managed
Azure AKS Free control plane + node costs Managed
Enterprise Various (Rancher, OpenShift, Tanzu) Support + add-ons

Compatibility

Dimension Support
Container runtimes containerd (default), CRI-O
Node OS Linux (primary), Windows (worker nodes)
CPU architecture amd64, arm64, arm/v7, s390x, ppc64le
Storage CSI (Ceph, EBS, GCE PD, Azure Disk, NFS, etc.)
Networking CNI plugins (Calico, Cilium, Flannel, etc.)
Infrastructure Bare metal, VMs, any cloud, edge

Scale Limits (Upstream)

Dimension Limit
Nodes per cluster 5,000
Pods per node 110 (default)
Pods per cluster 150,000
Services per cluster 10,000
Namespaces per cluster 10,000

Sources

Source URL Retrieved Via
Official Website https://kubernetes.io Direct
Documentation https://kubernetes.io/docs/ Direct
GitHub Repository https://github.com/kubernetes/kubernetes Direct
Releases https://kubernetes.io/releases/ Web Search
Release Cycle https://kubernetes.dev Web Search
API Reference https://kubernetes.io/docs/reference/kubernetes-api/ Direct
CNCF https://cncf.io Direct
CNCF Landscape https://landscape.cncf.io Direct
Scalability Targets https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md Direct
Ingress NGINX Retirement https://kubernetes.io/blog/ Web Search

Questions

Open Questions

Answered Questions

  • How does v1.36 SELinuxMount GA affect pod startup latency in enforcing environments? — SELinuxMount (KEP-1710, alpha in v1.28, GA in v1.30+) mounts volumes with the correct SELinux label via mount options (context=...) at mount time rather than recursively relabeling each file. In enforcing environments with large volumes (10K+ files), this reduces pod startup latency from minutes to seconds. The kernel applies the label at mount time, eliminating the per-file setxattr syscall overhead. Benchmarks show 10-100x improvement for volumes with many small files. No benchmark data specific to v1.36 yet, as SELinuxMount was already GA before v1.36. — resolved via KEP-1710 documentation

  • What is the production readiness of DRA (Dynamic Resource Allocation) for GPU partitioning in v1.36? — DRA (KEP-306) moved to Beta in v1.30 with structured parameters. As of v1.31/v1.32, it remains Beta and is not yet GA. The NVIDIA DRA driver (nvidia/k8s-dra-driver) supports MIG partitioning and time-slicing but is also experimental. For production GPU workloads today, the traditional NVIDIA device plugin with MIG/time-slicing remains the recommended path. DRA is expected to graduate to GA in v1.33 or later. — resolved via KEP-306 documentation

  • How does the Gateway API adoption compare to the retired ingress-nginx in production environments? — Gateway API is the official successor to Ingress. ingress-nginx was retired March 2026. Multiple production-grade controllers exist: Envoy Gateway (CNCF), Istio Gateway, Cilium Gateway, Kong Gateway. Gateway API provides richer routing (traffic splitting, header matching, weighted backends) and role-oriented design (ClusterOperator, InfrastructureProvider, ApplicationDeveloper). Migration from Ingress is straightforward for basic use cases but requires rethinking for advanced patterns. — resolved via Kubernetes Gateway API documentation
  • What are the recommended etcd backup strategies for clusters with >10,000 objects? — (1) Periodic snapshots via etcdctl snapshot save every 6-12 hours; (2) WAL archiving with --snapshot-count=10000 to reduce snapshot frequency; (3) Use etcd's built-in compaction (etcdctl compact) to prevent unlimited growth; (4) Store snapshots in external object storage (S3/GCS); (5) For large clusters, use etcd defrag during maintenance windows; (6) Test restore regularly on a separate cluster. Etcd recommends keeping last 2-3 snapshots. — resolved via etcd documentation

  • What is the max cluster size? → 5,000 nodes, 150,000 pods, 10,000 services (upstream thresholds). See infrastructure/kubernetes/index#Scale Limits (Upstream).

  • Is Docker still supported? → Dockershim was removed in v1.24. containerd and CRI-O are the supported runtimes.
  • What happened to ingress-nginx? → Retired March 24, 2026. Migrate to Gateway API controllers.
  • How does scheduling work? → Filter → Score → Bind. See infrastructure/kubernetes/architecture#How It Works.