Operations¶
Scope
Production deployment patterns, cluster management, performance tuning, upgrade procedures, and troubleshooting for Kubernetes.
Cluster Architecture Patterns¶
Control Plane High Availability¶
| Pattern | etcd Topology | API Server | Min Nodes | Use Case |
|---|---|---|---|---|
| Stacked | Co-located with control plane | 3+ | 3 | Most deployments |
| External | Dedicated etcd cluster | 3+ | 6 (3 etcd + 3 CP) | Enterprise, large scale |
| Single | Single node | 1 | 1 | Dev/test only |
Node Sizing Guidelines¶
| Workload Type | vCPUs | Memory | Storage | Network |
|---|---|---|---|---|
| General purpose | 4-8 | 16-32Gi | 100Gi SSD | 10Gbps |
| Memory-intensive (DB) | 8-16 | 64-128Gi | 500Gi NVMe | 10Gbps |
| GPU/ML | 8+ + GPU | 64Gi+ | 1Ti NVMe | 25Gbps |
| Edge/IoT | 2 | 4Gi | 32Gi | 1Gbps |
Performance Tuning¶
API Server¶
# Increase request limits for large clusters
--max-requests-inflight=400 # default: 400
--max-mutating-requests-inflight=200 # default: 200
--watch-cache-sizes=100 # per resource type
etcd¶
| Parameter | Small (< 100 nodes) | Large (100+ nodes) |
|---|---|---|
--quota-backend-bytes |
2Gi (default) | 8Gi |
--snapshot-count |
10000 | 50000 |
--auto-compaction-retention |
1h | 5m |
| Storage | SSD (min) | NVMe (required) |
etcd Performance
etcd is the #1 bottleneck in large clusters. Always use dedicated NVMe storage with < 10ms fsync latency. Run etcdctl check perf regularly.
Kubelet¶
# /var/lib/kubelet/config.yaml
maxPods: 110 # default, increase for dense nodes
containerLogMaxSize: "50Mi"
containerLogMaxFiles: 5
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
evictionHard:
memory.available: "500Mi"
nodefs.available: "10%"
imagefs.available: "15%"
Upgrade Procedures¶
Cluster Version Upgrade¶
Version Skew Policy
Kubernetes supports N-2 minor version skew between control plane and kubelets. Always upgrade control plane first, then nodes.
# 1. Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# 2. Upgrade kubelet and kubeadm
apt-get update && apt-get install -y kubelet=1.31.x-* kubeadm=1.31.x-*
systemctl restart kubelet
# 3. Uncordon node
kubectl uncordon <node>
Rolling Upgrade Strategy¶
- Upgrade control plane nodes one at a time
- Upgrade worker nodes in batches (10-20% at a time)
- Validate workload health between batches
- Keep old node pool available for rollback
Common Issues & Troubleshooting¶
| Symptom | Diagnosis | Resolution |
|---|---|---|
Node NotReady |
kubectl describe node |
Check kubelet logs, disk pressure, memory |
Pod stuck Pending |
kubectl describe pod |
Check resource requests, node affinity, PVC |
| DNS resolution failing | kubectl exec -it -- nslookup kubernetes |
Restart CoreDNS, check resolv.conf |
| CrashLoopBackOff | kubectl logs --previous |
Fix application error, check probes |
| ImagePullBackOff | Check image name/registry | Verify image exists, check pull secrets |
| etcd leader changes | etcdctl endpoint status |
Check disk latency, network partitions |
Monitoring Stack¶
Essential Metrics¶
# Cluster capacity utilization
sum(kube_pod_container_resource_requests{resource="cpu"}) / sum(kube_node_status_capacity{resource="cpu"})
# Node memory pressure
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
# API server latency
histogram_quantile(0.99, apiserver_request_duration_seconds_bucket{verb!="WATCH"})
# etcd WAL sync duration (should be < 10ms)
histogram_quantile(0.99, etcd_disk_wal_fsync_duration_seconds_bucket)
# Pod restart rate
rate(kube_pod_container_status_restarts_total[1h]) > 0
Backup & Disaster Recovery¶
etcd Backup¶
# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-out=table
Velero for Workload Backup¶
velero backup create full-backup --include-namespaces '*'
velero restore create --from-backup full-backup
Commands & Recipes¶
Cluster Operations¶
# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes
# Drain node for maintenance
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
kubectl uncordon node-1
# Check component health
kubectl get componentstatuses
kubectl get --raw '/healthz?verbose'
Workload Management¶
# Deploy and scale
kubectl apply -f deployment.yaml
kubectl scale deployment myapp --replicas=5
kubectl rollout status deployment myapp
# Rolling update
kubectl set image deployment/myapp app=myapp:v2.0
kubectl rollout undo deployment/myapp # rollback
# Restart pods (rolling)
kubectl rollout restart deployment/myapp
# Port forward for debugging
kubectl port-forward svc/myapp 8080:80
# Run one-off debug pod
kubectl run debug --rm -it --image=nicolaka/netshoot -- /bin/bash
Debugging¶
# Pod debugging
kubectl describe pod myapp-xxx # events, conditions
kubectl logs myapp-xxx -c app --previous # previous crash logs
kubectl logs -l app=myapp --all-containers -f # follow all pods
# Exec into running pod
kubectl exec -it myapp-xxx -- /bin/sh
# Debug node
kubectl debug node/node-1 -it --image=ubuntu
# Check events (sorted by time)
kubectl get events --sort-by='.lastTimestamp' -A
# Resource usage
kubectl top pods --sort-by=cpu -A
kubectl top pods --sort-by=memory
Networking¶
# View services and endpoints
kubectl get svc -o wide
kubectl get endpoints myapp
# DNS debugging
kubectl run dns-test --rm -it --image=busybox:1.36 -- nslookup myapp.default.svc.cluster.local
# View network policies
kubectl get networkpolicies -A
RBAC¶
# Check permissions
kubectl auth can-i create pods --namespace=production
kubectl auth can-i '*' '*' --all-namespaces # am I cluster-admin?
# Create service account with role
kubectl create serviceaccount deployer
kubectl create rolebinding deployer-binding \
--clusterrole=edit \
--serviceaccount=default:deployer
Helm¶
# Install chart
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install mydb bitnami/postgresql --set auth.postgresPassword=secret
# Upgrade with values
helm upgrade mydb bitnami/postgresql -f values.yaml
# Rollback
helm rollback mydb 1
# Template (dry-run render)
helm template mydb bitnami/postgresql -f values.yaml > rendered.yaml
Manifest Patterns¶
# Deployment with resource limits, probes, and topology spread
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: myapp
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 3
periodSeconds: 5