Operations¶

Scope

Production deployment patterns, cluster management, performance tuning, upgrade procedures, and troubleshooting for Kubernetes.

Cluster Architecture Patterns¶

Control Plane High Availability¶

Pattern	etcd Topology	API Server	Min Nodes	Use Case
Stacked	Co-located with control plane	3+	3	Most deployments
External	Dedicated etcd cluster	3+	6 (3 etcd + 3 CP)	Enterprise, large scale
Single	Single node	1	1	Dev/test only

Node Sizing Guidelines¶

Workload Type	vCPUs	Memory	Storage	Network
General purpose	4-8	16-32Gi	100Gi SSD	10Gbps
Memory-intensive (DB)	8-16	64-128Gi	500Gi NVMe	10Gbps
GPU/ML	8+ + GPU	64Gi+	1Ti NVMe	25Gbps
Edge/IoT	2	4Gi	32Gi	1Gbps

Performance Tuning¶

API Server¶

# Increase request limits for large clusters
--max-requests-inflight=400          # default: 400
--max-mutating-requests-inflight=200 # default: 200
--watch-cache-sizes=100              # per resource type

etcd¶

Parameter	Small (< 100 nodes)	Large (100+ nodes)
`--quota-backend-bytes`	2Gi (default)	8Gi
`--snapshot-count`	10000	50000
`--auto-compaction-retention`	1h	5m
Storage	SSD (min)	NVMe (required)

etcd Performance

etcd is the #1 bottleneck in large clusters. Always use dedicated NVMe storage with < 10ms fsync latency. Run etcdctl check perf regularly.

Kubelet¶

# /var/lib/kubelet/config.yaml
maxPods: 110                # default, increase for dense nodes
containerLogMaxSize: "50Mi"
containerLogMaxFiles: 5
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"

Upgrade Procedures¶

Cluster Version Upgrade¶

Version Skew Policy

Kubernetes supports N-2 minor version skew between control plane and kubelets. Always upgrade control plane first, then nodes.

# 1. Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# 2. Upgrade kubelet and kubeadm
apt-get update && apt-get install -y kubelet=1.31.x-* kubeadm=1.31.x-*
systemctl restart kubelet

# 3. Uncordon node
kubectl uncordon <node>

Rolling Upgrade Strategy¶

Upgrade control plane nodes one at a time
Upgrade worker nodes in batches (10-20% at a time)
Validate workload health between batches
Keep old node pool available for rollback

Common Issues & Troubleshooting¶

Symptom	Diagnosis	Resolution
Node `NotReady`	`kubectl describe node`	Check kubelet logs, disk pressure, memory
Pod stuck `Pending`	`kubectl describe pod`	Check resource requests, node affinity, PVC
DNS resolution failing	`kubectl exec -it -- nslookup kubernetes`	Restart CoreDNS, check `resolv.conf`
CrashLoopBackOff	`kubectl logs --previous`	Fix application error, check probes
ImagePullBackOff	Check image name/registry	Verify image exists, check pull secrets
etcd leader changes	`etcdctl endpoint status`	Check disk latency, network partitions

Monitoring Stack¶

Essential Metrics¶

# Cluster capacity utilization
sum(kube_pod_container_resource_requests{resource="cpu"}) / sum(kube_node_status_capacity{resource="cpu"})

# Node memory pressure
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1

# API server latency
histogram_quantile(0.99, apiserver_request_duration_seconds_bucket{verb!="WATCH"})

# etcd WAL sync duration (should be < 10ms)
histogram_quantile(0.99, etcd_disk_wal_fsync_duration_seconds_bucket)

# Pod restart rate
rate(kube_pod_container_status_restarts_total[1h]) > 0

Backup & Disaster Recovery¶

etcd Backup¶

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-*.db --write-out=table

Velero for Workload Backup¶

velero backup create full-backup --include-namespaces '*'
velero restore create --from-backup full-backup

Commands & Recipes¶

Cluster Operations¶

# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes

# Drain node for maintenance
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
kubectl uncordon node-1

# Check component health
kubectl get componentstatuses
kubectl get --raw '/healthz?verbose'

Workload Management¶

# Deploy and scale
kubectl apply -f deployment.yaml
kubectl scale deployment myapp --replicas=5
kubectl rollout status deployment myapp

# Rolling update
kubectl set image deployment/myapp app=myapp:v2.0
kubectl rollout undo deployment/myapp  # rollback

# Restart pods (rolling)
kubectl rollout restart deployment/myapp

# Port forward for debugging
kubectl port-forward svc/myapp 8080:80

# Run one-off debug pod
kubectl run debug --rm -it --image=nicolaka/netshoot -- /bin/bash

Debugging¶

# Pod debugging
kubectl describe pod myapp-xxx   # events, conditions
kubectl logs myapp-xxx -c app --previous  # previous crash logs
kubectl logs -l app=myapp --all-containers -f  # follow all pods

# Exec into running pod
kubectl exec -it myapp-xxx -- /bin/sh

# Debug node
kubectl debug node/node-1 -it --image=ubuntu

# Check events (sorted by time)
kubectl get events --sort-by='.lastTimestamp' -A

# Resource usage
kubectl top pods --sort-by=cpu -A
kubectl top pods --sort-by=memory

Networking¶

# View services and endpoints
kubectl get svc -o wide
kubectl get endpoints myapp

# DNS debugging
kubectl run dns-test --rm -it --image=busybox:1.36 -- nslookup myapp.default.svc.cluster.local

# View network policies
kubectl get networkpolicies -A

RBAC¶

# Check permissions
kubectl auth can-i create pods --namespace=production
kubectl auth can-i '*' '*' --all-namespaces  # am I cluster-admin?

# Create service account with role
kubectl create serviceaccount deployer
kubectl create rolebinding deployer-binding \
  --clusterrole=edit \
  --serviceaccount=default:deployer

Helm¶

# Install chart
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install mydb bitnami/postgresql --set auth.postgresPassword=secret

# Upgrade with values
helm upgrade mydb bitnami/postgresql -f values.yaml

# Rollback
helm rollback mydb 1

# Template (dry-run render)
helm template mydb bitnami/postgresql -f values.yaml > rendered.yaml

Manifest Patterns¶

# Deployment with resource limits, probes, and topology spread
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: myapp
      containers:
        - name: app
          image: myapp:latest
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 3
            periodSeconds: 5