Skip to content

Operations

Production deployment of Redpanda — sizing, tuning, troubleshooting, and a Commands & Recipes section using rpk and the Helm chart.

Deployment Patterns

Single-binary install

For development or small production:

curl -1sLf 'https://dl.redpanda.com/nzc4ZYQK3WRGd9sy/redpanda/cfg/setup/bash.deb.sh' | sudo bash
sudo apt install redpanda
sudo systemctl enable --now redpanda

rpk autotunes the OS (CPU governor, swap, I/O scheduler) on first run.

Three brokers across three AZs with NVMe local storage, S3 for tiered tier:

# /etc/redpanda/redpanda.yaml
redpanda:
  data_directory: /var/lib/redpanda/data
  node_id: 0
  seed_servers:
    - host: { address: redpanda-0, port: 33145 }
    - host: { address: redpanda-1, port: 33145 }
    - host: { address: redpanda-2, port: 33145 }
  rpc_server:
    address: 0.0.0.0
    port: 33145
  kafka_api:
    - address: 0.0.0.0
      port: 9092
  admin:
    - address: 0.0.0.0
      port: 9644
  cloud_storage_enabled: true
  cloud_storage_bucket: redpanda-tiered-prod
  cloud_storage_region: us-east-1
  cloud_storage_credentials_source: aws_instance_metadata

Kubernetes (Operator + Helm)

helm repo add redpanda https://charts.redpanda.com
helm install redpanda redpanda/redpanda \
  --namespace redpanda --create-namespace \
  --set tls.enabled=true \
  --set listeners.kafka.tls.enabled=true \
  --set storage.persistentVolume.size=200Gi \
  --set storage.persistentVolume.storageClass=ssd \
  --set resources.cpu.cores=4 \
  --set statefulset.replicas=3

The Helm chart supports both the simple StatefulSet path and the Redpanda Operator for declarative cluster, topic, and user CRDs.

Sizing

Resource Guidance
CPU Cores = number of Seastar shards. Start with 4–8 cores per node.
Memory 2 GB per core minimum; 4 GB+ for hot working sets.
Disk NVMe local for data_directory; size to retain hot tier (working set).
Object storage S3 / GCS / Azure ADLS for tiered storage; size to full retention.
Network 10 GbE+ between brokers; lower-latency nics improve tail.
Hugepages Optional but help latency at high throughput.

Best Practices

  • Pin rpk redpanda mode production — applies all OS optimizations.
  • One core = one shard = stable partition assignment; resizing CPU requires a careful re-sharding plan.
  • Disable swap entirely; Seastar pre-allocates memory.
  • Use rpk cluster config edit rather than hand-editing YAML — it validates syntactically.
  • Set cloud_storage_enable_remote_write: true for any topic where you want long retention without local disk cost.
  • Cap topic_partitions_per_shard to avoid unbalanced shards.
  • For mTLS, use the Operator's CertManager integration rather than rolling your own.
  • Enable Continuous Data Balancing (Enterprise) for clusters with skewed partition leadership.

Performance Tuning

Tunable Effect
kafka_request_max_bytes Maximum request size (default 100 MB). Raise for large batch producers.
group_min_session_timeout_ms / max_session_timeout_ms Consumer group health checks.
cloud_storage_segment_max_upload_interval_sec How often closed segments upload to S3.
cloud_storage_max_connections Concurrency to S3; raise for higher upload throughput.
log_segment_size Default 1 GiB. Smaller segments tier more often (more S3 ops); larger segments retain more locally.
log_compaction_interval_ms Compaction frequency for compacted topics.
enable_idempotence / enable_transactions Required for Kafka EOS workloads.
tls_min_v Force TLS 1.3 only.

Troubleshooting

Raft leadership thrash

Symptom: rpk cluster health shows leaders changing rapidly; client logs show LEADER_NOT_AVAILABLE retries.

Causes: insufficient disk IOPS, oversubscribed CPU, network latency.

Fixes: - Inspect redpanda_raft_leadership_changes_total Prometheus metric. - Check iostat -xz 1 for per-disk write latency >5ms. - Run rpk debug bundle and ship to Redpanda support.

Slow archival uploads

Symptom: redpanda_cloud_storage_uploadable_segments keeps growing.

Causes: S3 throttling, undersized cloud_storage_max_connections, slow network.

Fix: - Raise cloud_storage_max_connections (default 20). - Check S3 bucket for prefix hot-spotting; partition by topic-prefix.

Broker degraded state

Symptom: rpk cluster health shows a broker not "healthy".

Causes: disk full, OOM, network partition, certificate expiry.

Fix: - rpk cluster info for detailed broker state. - Verify TLS cert rotation if mTLS is enabled. - Check journalctl -u redpanda and /var/log/redpanda/redpanda.log.

Controller backlog

Symptom: redpanda_controller_pending_tasks > 100; topic creates take seconds.

Cause: controller Raft leader is overloaded.

Fix: - Step down controller leadership: rpk cluster step-down --controller. - Check that the controller leader is on a low-load broker.

Kafka client compatibility issue

Symptom: specific Kafka client throws on a feature.

Fix: Redpanda implements most Kafka KIPs but may lag on the very latest. Check the feature parity matrix and pin the client to compatible Kafka API versions.

Cost Analysis

Cost Driver
Compute Per-core; thread-per-core means scaling cores ≈ scaling throughput linearly.
Local disk Only the hot tier needs to live on NVMe; size to working set + safety margin.
Object storage Cold tier; cost dominated by storage GB rather than ops.
Network egress Cross-region replication in Cloud Multi-AZ.
Enterprise license Required for read replicas, audit logs, continuous balancing.
Redpanda Cloud Per-cluster (Dedicated) or per-data (Serverless).

Commands & Recipes

Cluster bootstrap

# Single-node dev
rpk container start -n 3       # spin up 3-node Docker cluster
rpk cluster info
rpk cluster health

Topic management

rpk topic create orders --partitions 12 --replicas 3 \
  --topic-config retention.ms=604800000 \
  --topic-config cleanup.policy=delete \
  --topic-config redpanda.remote.read=true \
  --topic-config redpanda.remote.write=true

rpk topic list
rpk topic describe orders
rpk topic produce orders < data.txt
rpk topic consume orders -o start -n 5
rpk topic delete orders

ACLs

rpk acl create --allow-principal 'User:app' \
  --operation read,describe \
  --topic 'orders.*'

rpk acl list

Cluster config

rpk cluster config get
rpk cluster config edit
rpk cluster config set cloud_storage_max_connections 50

User management (SASL/SCRAM)

rpk acl user create app -p 'change-me-now' --mechanism SCRAM-SHA-512
rpk acl user list
rpk acl user delete app

Schema Registry

rpk registry subject list
rpk registry schema create orders-value --schema ./schema.avsc --type avro
rpk registry compatibility set orders-value --level FORWARD_TRANSITIVE

Connect (Benthos-based)

rpk connect run ./pipeline.yaml

Debug bundle

rpk debug bundle --output /tmp/bundle.zip

Prometheus

Redpanda exposes :9644/public_metrics (Prometheus format) and :9644/metrics (internal, more verbose). Apply official Grafana dashboards from redpanda-data/observability.

Helm upgrade

helm upgrade redpanda redpanda/redpanda \
  --namespace redpanda \
  --reuse-values \
  --version 5.x.x

Rolling restart with leadership transfer

for n in 0 1 2; do
  rpk cluster step-down --node $n
  kubectl rollout restart statefulset/redpanda -n redpanda
  rpk cluster health --watch
done

Cross-references