Skip to content

Kubernetes — Operations

Scope

Production deployment patterns, cluster management, performance tuning, upgrade procedures, and troubleshooting for Kubernetes.

Cluster Architecture Patterns

Control Plane High Availability

Pattern etcd Topology API Server Min Nodes Use Case
Stacked Co-located with control plane 3+ 3 Most deployments
External Dedicated etcd cluster 3+ 6 (3 etcd + 3 CP) Enterprise, large scale
Single Single node 1 1 Dev/test only

Node Sizing Guidelines

Workload Type vCPUs Memory Storage Network
General purpose 4-8 16-32Gi 100Gi SSD 10Gbps
Memory-intensive (DB) 8-16 64-128Gi 500Gi NVMe 10Gbps
GPU/ML 8+ + GPU 64Gi+ 1Ti NVMe 25Gbps
Edge/IoT 2 4Gi 32Gi 1Gbps

Performance Tuning

API Server

# Increase request limits for large clusters
--max-requests-inflight=400          # default: 400
--max-mutating-requests-inflight=200 # default: 200
--watch-cache-sizes=100              # per resource type

etcd

Parameter Small (< 100 nodes) Large (100+ nodes)
--quota-backend-bytes 2Gi (default) 8Gi
--snapshot-count 10000 50000
--auto-compaction-retention 1h 5m
Storage SSD (min) NVMe (required)

etcd Performance

etcd is the #1 bottleneck in large clusters. Always use dedicated NVMe storage with < 10ms fsync latency. Run etcdctl check perf regularly.

Kubelet

# /var/lib/kubelet/config.yaml
maxPods: 110                # default, increase for dense nodes
containerLogMaxSize: "50Mi"
containerLogMaxFiles: 5
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"

Upgrade Procedures

Cluster Version Upgrade

Version Skew Policy

Kubernetes supports N-2 minor version skew between control plane and kubelets. Always upgrade control plane first, then nodes.

# 1. Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# 2. Upgrade kubelet and kubeadm
apt-get update && apt-get install -y kubelet=1.31.x-* kubeadm=1.31.x-*
systemctl restart kubelet

# 3. Uncordon node
kubectl uncordon <node>

Rolling Upgrade Strategy

  1. Upgrade control plane nodes one at a time
  2. Upgrade worker nodes in batches (10-20% at a time)
  3. Validate workload health between batches
  4. Keep old node pool available for rollback

Common Issues & Troubleshooting

Symptom Diagnosis Resolution
Node NotReady kubectl describe node Check kubelet logs, disk pressure, memory
Pod stuck Pending kubectl describe pod Check resource requests, node affinity, PVC
DNS resolution failing kubectl exec -it -- nslookup kubernetes Restart CoreDNS, check resolv.conf
CrashLoopBackOff kubectl logs --previous Fix application error, check probes
ImagePullBackOff Check image name/registry Verify image exists, check pull secrets
etcd leader changes etcdctl endpoint status Check disk latency, network partitions

Monitoring Stack

Essential Metrics

# Cluster capacity utilization
sum(kube_pod_container_resource_requests{resource="cpu"}) / sum(kube_node_status_capacity{resource="cpu"})

# Node memory pressure
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1

# API server latency
histogram_quantile(0.99, apiserver_request_duration_seconds_bucket{verb!="WATCH"})

# etcd WAL sync duration (should be < 10ms)
histogram_quantile(0.99, etcd_disk_wal_fsync_duration_seconds_bucket)

# Pod restart rate
rate(kube_pod_container_status_restarts_total[1h]) > 0

Backup & Disaster Recovery

etcd Backup

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-out=table

Velero for Workload Backup

velero backup create full-backup --include-namespaces '*'
velero restore create --from-backup full-backup