Operations¶
Production deployment of Redpanda — sizing, tuning, troubleshooting, and a Commands & Recipes section using rpk and the Helm chart.
Deployment Patterns¶
Single-binary install¶
For development or small production:
curl -1sLf 'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' | sudo bash
sudo apt install redpanda
sudo systemctl enable --now redpanda
rpk autotunes the OS (CPU governor, swap, I/O scheduler) on first run.
Three-node cluster (recommended baseline)¶
Three brokers across three AZs with NVMe local storage, S3 for tiered tier:
# /etc/redpanda/redpanda.yaml
redpanda:
data_directory: /var/lib/redpanda/data
node_id: 0
seed_servers:
- host: { address: redpanda-0, port: 33145 }
- host: { address: redpanda-1, port: 33145 }
- host: { address: redpanda-2, port: 33145 }
rpc_server:
address: 0.0.0.0
port: 33145
kafka_api:
- address: 0.0.0.0
port: 9092
admin:
- address: 0.0.0.0
port: 9644
cloud_storage_enabled: true
cloud_storage_bucket: redpanda-tiered-prod
cloud_storage_region: us-east-1
cloud_storage_credentials_source: aws_instance_metadata
Kubernetes (Operator + Helm)¶
helm repo add redpanda https://charts.redpanda.com
helm install redpanda redpanda/redpanda \
--namespace redpanda --create-namespace \
--set tls.enabled=true \
--set listeners.kafka.tls.enabled=true \
--set storage.persistentVolume.size=200Gi \
--set storage.persistentVolume.storageClass=ssd \
--set resources.cpu.cores=4 \
--set statefulset.replicas=3
The Helm chart supports both the simple StatefulSet path and the Redpanda Operator for declarative cluster, topic, and user CRDs.
Sizing¶
| Resource | Guidance |
|---|---|
| CPU | Cores = number of Seastar shards. Start with 4–8 cores per node. |
| Memory | 2 GB per core minimum; 4 GB+ for hot working sets. |
| Disk | NVMe local for data_directory; size to retain hot tier (working set). |
| Object storage | S3 / GCS / Azure ADLS for tiered storage; size to full retention. |
| Network | 10 GbE+ between brokers; lower-latency nics improve tail. |
| Hugepages | Optional but help latency at high throughput. |
Best Practices¶
- Pin
rpk redpanda mode production— applies all OS optimizations. - One core = one shard = stable partition assignment; resizing CPU requires a careful re-sharding plan.
- Disable swap entirely; Seastar pre-allocates memory.
- Use
rpk cluster config editrather than hand-editing YAML — it validates syntactically. - Set
cloud_storage_enable_remote_write: truefor any topic where you want long retention without local disk cost. - Cap
topic_partitions_per_shardto avoid unbalanced shards. - For mTLS, use the Operator's CertManager integration rather than rolling your own.
- Enable Continuous Data Balancing (Enterprise) for clusters with skewed partition leadership.
Performance Tuning¶
| Tunable | Effect |
|---|---|
kafka_request_max_bytes |
Maximum request size (default 100 MB). Raise for large batch producers. |
group_min_session_timeout_ms / max_session_timeout_ms |
Consumer group health checks. |
cloud_storage_segment_max_upload_interval_sec |
How often closed segments upload to S3. |
cloud_storage_max_connections |
Concurrency to S3; raise for higher upload throughput. |
log_segment_size |
Default 1 GiB. Smaller segments tier more often (more S3 ops); larger segments retain more locally. |
log_compaction_interval_ms |
Compaction frequency for compacted topics. |
enable_idempotence / enable_transactions |
Required for Kafka EOS workloads. |
tls_min_v |
Force TLS 1.3 only. |
Troubleshooting¶
Raft leadership thrash¶
Symptom: rpk cluster health shows leaders changing rapidly; client logs show LEADER_NOT_AVAILABLE retries.
Causes: insufficient disk IOPS, oversubscribed CPU, network latency.
Fixes:
- Inspect redpanda_raft_leadership_changes_total Prometheus metric.
- Check iostat -xz 1 for per-disk write latency >5ms.
- Run rpk debug bundle and ship to Redpanda support.
Slow archival uploads¶
Symptom: redpanda_cloud_storage_uploadable_segments keeps growing.
Causes: S3 throttling, undersized cloud_storage_max_connections, slow network.
Fix:
- Raise cloud_storage_max_connections (default 20).
- Check S3 bucket for prefix hot-spotting; partition by topic-prefix.
Broker degraded state¶
Symptom: rpk cluster health shows a broker not "healthy".
Causes: disk full, OOM, network partition, certificate expiry.
Fix:
- rpk cluster info for detailed broker state.
- Verify TLS cert rotation if mTLS is enabled.
- Check journalctl -u redpanda and /var/log/redpanda/redpanda.log.
Controller backlog¶
Symptom: redpanda_controller_pending_tasks > 100; topic creates take seconds.
Cause: controller Raft leader is overloaded.
Fix:
- Step down controller leadership: rpk cluster step-down --controller.
- Check that the controller leader is on a low-load broker.
Kafka client compatibility issue¶
Symptom: specific Kafka client throws on a feature.
Fix: Redpanda implements most Kafka KIPs but may lag on the very latest. Check the feature parity matrix and pin the client to compatible Kafka API versions.
Cost Analysis¶
| Cost | Driver |
|---|---|
| Compute | Per-core; thread-per-core means scaling cores ≈ scaling throughput linearly. |
| Local disk | Only the hot tier needs to live on NVMe; size to working set + safety margin. |
| Object storage | Cold tier; cost dominated by storage GB rather than ops. |
| Network egress | Cross-region replication in Cloud Multi-AZ. |
| Enterprise license | Required for read replicas, audit logs, continuous balancing. |
| Redpanda Cloud | Per-cluster (Dedicated) or per-data (Serverless). |
Commands & Recipes¶
Cluster bootstrap¶
# Single-node dev
rpk container start -n 3 # spin up 3-node Docker cluster
rpk cluster info
rpk cluster health
Topic management¶
rpk topic create orders --partitions 12 --replicas 3 \
--topic-config retention.ms=604800000 \
--topic-config cleanup.policy=delete \
--topic-config redpanda.remote.read=true \
--topic-config redpanda.remote.write=true
rpk topic list
rpk topic describe orders
rpk topic produce orders < data.txt
rpk topic consume orders -o start -n 5
rpk topic delete orders
ACLs¶
rpk acl create --allow-principal 'User:app' \
--operation read,describe \
--topic 'orders.*'
rpk acl list
Cluster config¶
rpk cluster config get
rpk cluster config edit
rpk cluster config set cloud_storage_max_connections 50
User management (SASL/SCRAM)¶
rpk acl user create app -p 'change-me-now' --mechanism SCRAM-SHA-512
rpk acl user list
rpk acl user delete app
Schema Registry¶
rpk registry subject list
rpk registry schema create orders-value --schema ./schema.avsc --type avro
rpk registry compatibility set orders-value --level FORWARD_TRANSITIVE
Connect (Benthos-based)¶
Debug bundle¶
Prometheus¶
Redpanda exposes :9644/public_metrics (Prometheus format) and :9644/metrics (internal, more verbose). Apply official Grafana dashboards from redpanda-data/observability.
Helm upgrade¶
Rolling restart with leadership transfer¶
for n in 0 1 2; do
rpk cluster step-down --node $n
kubectl rollout restart statefulset/redpanda -n redpanda
rpk cluster health --watch
done
Cross-references¶
- messaging/redpanda/architecture — for the Seastar / Raft model you operate on.
- messaging/redpanda/security — for SASL / mTLS / RBAC.
- messaging/kafka/operations — for migration considerations.
- messaging/index — for cross-broker comparisons.