Operations¶

Production deployment of Redpanda — sizing, tuning, troubleshooting, and a Commands & Recipes section using rpk and the Helm chart.

Deployment Patterns¶

Single-binary install¶

For development or small production:

curl -1sLf 'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' | sudo bash
sudo apt install redpanda
sudo systemctl enable --now redpanda

rpk autotunes the OS (CPU governor, swap, I/O scheduler) on first run.

Three-node cluster (recommended baseline)¶

Three brokers across three AZs with NVMe local storage, S3 for tiered tier:

# /etc/redpanda/redpanda.yaml
redpanda:
  data_directory: /var/lib/redpanda/data
  node_id: 0
  seed_servers:
    - host: { address: redpanda-0, port: 33145 }
    - host: { address: redpanda-1, port: 33145 }
    - host: { address: redpanda-2, port: 33145 }
  rpc_server:
    address: 0.0.0.0
    port: 33145
  kafka_api:
    - address: 0.0.0.0
      port: 9092
  admin:
    - address: 0.0.0.0
      port: 9644
  cloud_storage_enabled: true
  cloud_storage_bucket: redpanda-tiered-prod
  cloud_storage_region: us-east-1
  cloud_storage_credentials_source: aws_instance_metadata

Kubernetes (Operator + Helm)¶

helm repo add redpanda https://charts.redpanda.com
helm install redpanda redpanda/redpanda \
  --namespace redpanda --create-namespace \
  --set tls.enabled=true \
  --set listeners.kafka.tls.enabled=true \
  --set storage.persistentVolume.size=200Gi \
  --set storage.persistentVolume.storageClass=ssd \
  --set resources.cpu.cores=4 \
  --set statefulset.replicas=3

The Helm chart supports both the simple StatefulSet path and the Redpanda Operator for declarative cluster, topic, and user CRDs.

Sizing¶

Resource	Guidance
CPU	Cores = number of Seastar shards. Start with 4–8 cores per node.
Memory	2 GB per core minimum; 4 GB+ for hot working sets.
Disk	NVMe local for `data_directory`; size to retain hot tier (working set).
Object storage	S3 / GCS / Azure ADLS for tiered storage; size to full retention.
Network	10 GbE+ between brokers; lower-latency nics improve tail.
Hugepages	Optional but help latency at high throughput.

Best Practices¶

Pin rpk redpanda mode production — applies all OS optimizations.
One core = one shard = stable partition assignment; resizing CPU requires a careful re-sharding plan.
Disable swap entirely; Seastar pre-allocates memory.
Use rpk cluster config edit rather than hand-editing YAML — it validates syntactically.
Set cloud_storage_enable_remote_write: true for any topic where you want long retention without local disk cost.
Cap topic_partitions_per_shard to avoid unbalanced shards.
For mTLS, use the Operator's CertManager integration rather than rolling your own.
Enable Continuous Data Balancing (Enterprise) for clusters with skewed partition leadership.

Performance Tuning¶

Tunable	Effect
`kafka_request_max_bytes`	Maximum request size (default 100 MB). Raise for large batch producers.
`group_min_session_timeout_ms` / `max_session_timeout_ms`	Consumer group health checks.
`cloud_storage_segment_max_upload_interval_sec`	How often closed segments upload to S3.
`cloud_storage_max_connections`	Concurrency to S3; raise for higher upload throughput.
`log_segment_size`	Default 1 GiB. Smaller segments tier more often (more S3 ops); larger segments retain more locally.
`log_compaction_interval_ms`	Compaction frequency for compacted topics.
`enable_idempotence` / `enable_transactions`	Required for Kafka EOS workloads.
`tls_min_v`	Force TLS 1.3 only.

Troubleshooting¶

Raft leadership thrash¶

Symptom: rpk cluster health shows leaders changing rapidly; client logs show LEADER_NOT_AVAILABLE retries.

Causes: insufficient disk IOPS, oversubscribed CPU, network latency.

Fixes: - Inspect redpanda_raft_leadership_changes_total Prometheus metric. - Check iostat -xz 1 for per-disk write latency >5ms. - Run rpk debug bundle and ship to Redpanda support.

Slow archival uploads¶

Symptom: redpanda_cloud_storage_uploadable_segments keeps growing.

Causes: S3 throttling, undersized cloud_storage_max_connections, slow network.

Fix: - Raise cloud_storage_max_connections (default 20). - Check S3 bucket for prefix hot-spotting; partition by topic-prefix.

Broker degraded state¶

Symptom: rpk cluster health shows a broker not "healthy".

Causes: disk full, OOM, network partition, certificate expiry.

Fix: - rpk cluster info for detailed broker state. - Verify TLS cert rotation if mTLS is enabled. - Check journalctl -u redpanda and /var/log/redpanda/redpanda.log.

Controller backlog¶

Symptom: redpanda_controller_pending_tasks > 100; topic creates take seconds.

Cause: controller Raft leader is overloaded.

Fix: - Step down controller leadership: rpk cluster step-down --controller. - Check that the controller leader is on a low-load broker.

Kafka client compatibility issue¶

Symptom: specific Kafka client throws on a feature.

Fix: Redpanda implements most Kafka KIPs but may lag on the very latest. Check the feature parity matrix and pin the client to compatible Kafka API versions.

Cost Analysis¶

Cost	Driver
Compute	Per-core; thread-per-core means scaling cores ≈ scaling throughput linearly.
Local disk	Only the hot tier needs to live on NVMe; size to working set + safety margin.
Object storage	Cold tier; cost dominated by storage GB rather than ops.
Network egress	Cross-region replication in Cloud Multi-AZ.
Enterprise license	Required for read replicas, audit logs, continuous balancing.
Redpanda Cloud	Per-cluster (Dedicated) or per-data (Serverless).

Commands & Recipes¶

Cluster bootstrap¶

# Single-node dev
rpk container start -n 3       # spin up 3-node Docker cluster
rpk cluster info
rpk cluster health

Topic management¶

rpk topic create orders --partitions 12 --replicas 3 \
  --topic-config retention.ms=604800000 \
  --topic-config cleanup.policy=delete \
  --topic-config redpanda.remote.read=true \
  --topic-config redpanda.remote.write=true

rpk topic list
rpk topic describe orders
rpk topic produce orders < data.txt
rpk topic consume orders -o start -n 5
rpk topic delete orders

ACLs¶

rpk acl create --allow-principal 'User:app' \
  --operation read,describe \
  --topic 'orders.*'

rpk acl list

Cluster config¶

rpk cluster config get
rpk cluster config edit
rpk cluster config set cloud_storage_max_connections 50

User management (SASL/SCRAM)¶

rpk acl user create app -p 'change-me-now' --mechanism SCRAM-SHA-512
rpk acl user list
rpk acl user delete app

Schema Registry¶

rpk registry subject list
rpk registry schema create orders-value --schema ./schema.avsc --type avro
rpk registry compatibility set orders-value --level FORWARD_TRANSITIVE

Connect (Benthos-based)¶

rpk connect run ./pipeline.yaml

Debug bundle¶

rpk debug bundle --output /tmp/bundle.zip

Prometheus¶

Redpanda exposes :9644/public_metrics (Prometheus format) and :9644/metrics (internal, more verbose). Apply official Grafana dashboards from redpanda-data/observability.

Helm upgrade¶

helm upgrade redpanda redpanda/redpanda \
  --namespace redpanda \
  --reuse-values \
  --version 5.x.x

Rolling restart with leadership transfer¶

for n in 0 1 2; do
  rpk cluster step-down --node $n
  kubectl rollout restart statefulset/redpanda -n redpanda
  rpk cluster health --watch
done

Cross-references¶

messaging/redpanda/architecture — for the Seastar / Raft model you operate on.
messaging/redpanda/security — for SASL / mTLS / RBAC.
messaging/kafka/operations — for migration considerations.
messaging/index — for cross-broker comparisons.