Kubernetes — Operations¶
Scope
Production deployment patterns, cluster management, performance tuning, upgrade procedures, and troubleshooting for Kubernetes.
Cluster Architecture Patterns¶
Control Plane High Availability¶
| Pattern | etcd Topology | API Server | Min Nodes | Use Case |
|---|---|---|---|---|
| Stacked | Co-located with control plane | 3+ | 3 | Most deployments |
| External | Dedicated etcd cluster | 3+ | 6 (3 etcd + 3 CP) | Enterprise, large scale |
| Single | Single node | 1 | 1 | Dev/test only |
Node Sizing Guidelines¶
| Workload Type | vCPUs | Memory | Storage | Network |
|---|---|---|---|---|
| General purpose | 4-8 | 16-32Gi | 100Gi SSD | 10Gbps |
| Memory-intensive (DB) | 8-16 | 64-128Gi | 500Gi NVMe | 10Gbps |
| GPU/ML | 8+ + GPU | 64Gi+ | 1Ti NVMe | 25Gbps |
| Edge/IoT | 2 | 4Gi | 32Gi | 1Gbps |
Performance Tuning¶
API Server¶
# Increase request limits for large clusters
--max-requests-inflight=400 # default: 400
--max-mutating-requests-inflight=200 # default: 200
--watch-cache-sizes=100 # per resource type
etcd¶
| Parameter | Small (< 100 nodes) | Large (100+ nodes) |
|---|---|---|
--quota-backend-bytes |
2Gi (default) | 8Gi |
--snapshot-count |
10000 | 50000 |
--auto-compaction-retention |
1h | 5m |
| Storage | SSD (min) | NVMe (required) |
etcd Performance
etcd is the #1 bottleneck in large clusters. Always use dedicated NVMe storage with < 10ms fsync latency. Run etcdctl check perf regularly.
Kubelet¶
# /var/lib/kubelet/config.yaml
maxPods: 110 # default, increase for dense nodes
containerLogMaxSize: "50Mi"
containerLogMaxFiles: 5
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
evictionHard:
memory.available: "500Mi"
nodefs.available: "10%"
imagefs.available: "15%"
Upgrade Procedures¶
Cluster Version Upgrade¶
Version Skew Policy
Kubernetes supports N-2 minor version skew between control plane and kubelets. Always upgrade control plane first, then nodes.
# 1. Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# 2. Upgrade kubelet and kubeadm
apt-get update && apt-get install -y kubelet=1.31.x-* kubeadm=1.31.x-*
systemctl restart kubelet
# 3. Uncordon node
kubectl uncordon <node>
Rolling Upgrade Strategy¶
- Upgrade control plane nodes one at a time
- Upgrade worker nodes in batches (10-20% at a time)
- Validate workload health between batches
- Keep old node pool available for rollback
Common Issues & Troubleshooting¶
| Symptom | Diagnosis | Resolution |
|---|---|---|
Node NotReady |
kubectl describe node |
Check kubelet logs, disk pressure, memory |
Pod stuck Pending |
kubectl describe pod |
Check resource requests, node affinity, PVC |
| DNS resolution failing | kubectl exec -it -- nslookup kubernetes |
Restart CoreDNS, check resolv.conf |
| CrashLoopBackOff | kubectl logs --previous |
Fix application error, check probes |
| ImagePullBackOff | Check image name/registry | Verify image exists, check pull secrets |
| etcd leader changes | etcdctl endpoint status |
Check disk latency, network partitions |
Monitoring Stack¶
Essential Metrics¶
# Cluster capacity utilization
sum(kube_pod_container_resource_requests{resource="cpu"}) / sum(kube_node_status_capacity{resource="cpu"})
# Node memory pressure
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
# API server latency
histogram_quantile(0.99, apiserver_request_duration_seconds_bucket{verb!="WATCH"})
# etcd WAL sync duration (should be < 10ms)
histogram_quantile(0.99, etcd_disk_wal_fsync_duration_seconds_bucket)
# Pod restart rate
rate(kube_pod_container_status_restarts_total[1h]) > 0
Backup & Disaster Recovery¶
etcd Backup¶
# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-out=table