Skip to content

Operations

Scope

Production deployment patterns for Apache Kafka 4.x in KRaft mode, performance tuning (producer/broker/consumer), troubleshooting (under-replicated partitions, ISR shrinkage, broker bounces), tiered storage operations, and cost analysis.

Production Deployment Patterns

Cluster Topology (KRaft)

Pattern Controllers Brokers When to Use
Combined (process.roles=controller,broker) 1–3 same nodes Dev / staging / very small prod (<5 brokers)
Isolated (process.roles=controller vs broker) 3 dedicated N dedicated Production. Recommended for any cluster you'd page someone for.
Cross-AZ 3 controllers spread across 3 AZs brokers spread across 3 AZs, RF=3 with min.insync.replicas=2 Single-region HA
Cross-region (active-passive) one cluster per region MirrorMaker 2 replicating DR with RPO measured in seconds
Cross-region (active-active) one cluster per region bidirectional MM2 with prefix-renamed topics Lowest-RPO HA at the cost of dual writes / topic prefixing

A 3-node controller quorum tolerates 1 controller failure; 5 nodes tolerate 2. Most production clusters run 3 controllers.

Broker Sizing

  • CPU: 8–16 vCPU per broker is typical. ZSTD compression and TLS handshakes are the heaviest CPU consumers.
  • RAM: page cache is everything. Plan for >=32 GiB (often 64–128 GiB) so that the active workload fits in cache. JVM heap should be capped (4–6 GiB is usually enough; do not let it crowd out the page cache).
  • Disk: SSD/NVMe; XFS preferred over ext4. Separate log directory from OS disk. Multiple log directories on multiple disks distribute partition load.
  • Network: 10 GbE minimum, 25 GbE+ for large clusters. Replication doubles or triples your inbound bytes.
  • JVM: G1GC remains the default; ZGC has been adopted at large shops for sub-ms pauses.

Topic Design

Decision Guidance
Partitions per topic Pick by max consumer parallelism × headroom (e.g. 12, 24, 48). Hard to decrease later.
Replication factor 3 for production. RF=2 gives no headroom for a node failure during a rolling upgrade.
min.insync.replicas Set to RF − 1 (typically 2 for RF=3). Combined with acks=all this gives durable writes.
Cleanup policy delete (time/size retention) for events; compact for state; compact,delete for hybrid (e.g. CDC).
Retention Set per topic via retention.ms and/or retention.bytes.
Compression Producer-side compression.type=zstd is usually best ratio; lz4 and snappy are lower CPU.
Tiered storage For >7 days retention at >100 MB/s, enable remote.storage.enable=true per topic.

Partition Planning

  • Total partitions per cluster scales with brokers and hardware. A modest 6-broker KRaft cluster comfortably handles tens of thousands of partitions; large Confluent Cloud clusters serve hundreds of thousands.
  • Per-broker partition count drives metadata cache size, follower fetch threads, and recovery time after restart. Watch the kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount metric.
  • Increasing partitions later is cheap for new keys but breaks key-based ordering for existing keys.

Performance Tuning

Producer

Setting Default Tuning Advice
acks all (4.x default) Keep all for durability. Set to 1 only for low-stakes telemetry.
enable.idempotence true (4.x default) Always on; avoids duplicates without sacrificing throughput.
compression.type none Use zstd for highest ratio, lz4/snappy for lower CPU.
batch.size 16384 (16 KiB) Increase to 64 KiB–1 MiB for throughput-oriented workloads.
linger.ms 5 (4.x default) Was 0 pre-4.0. Higher = larger batches = better throughput, worse latency.
max.in.flight.requests.per.connection 5 Keep at 5 (max for idempotence in 4.x; auto-fallback was removed).
buffer.memory 32 MiB Increase if you see record-send-rate drops or buffer-pool waits.

Broker

Setting Default Tuning Advice
num.io.threads 8 Match to disk parallelism; raise for many-disk hosts.
num.network.threads 3 Raise for high connection counts.
num.replica.fetchers 1 Raise to 4–8 for clusters with many followers per broker.
socket.send.buffer.bytes / socket.receive.buffer.bytes 100 KiB / 100 KiB Match to BDP for fat networks.
log.flush.interval.messages / log.flush.interval.ms unlimited Do not enable unless you have a specific durability mandate; rely on RF + acks=all instead.
log.segment.bytes 1 GiB Smaller segments = more retention granularity but more files.

Consumer

Setting Default Tuning Advice
fetch.min.bytes 1 Raise to 1–10 KiB for batching efficiency.
fetch.max.wait.ms 500 Pair with fetch.min.bytes to amortize fetches.
max.poll.records 500 Lower for slow per-record processing to avoid max.poll.interval.ms timeouts.
isolation.level read_uncommitted Set to read_committed for transactional pipelines.
group.protocol classic (default) Set to consumer to opt into KIP-848 next-gen rebalance.

Page Cache, Zero-Copy, Sendfile

  • Kafka deliberately delegates caching to the Linux page cache rather than maintaining a JVM-side buffer. Do not pin the JVM heap so large that it crowds out the cache; 4–6 GiB heap on a 64 GiB host is typical.
  • Tail reads (consumers near the head of the log) are served almost entirely from page cache.
  • For lagging consumers, the broker uses FileChannel.transferTo (Linux sendfile(2)) to stream the on-disk batch straight to the consumer socket without copying through user space — provided the data is not encrypted at rest by an in-broker mechanism (TLS-on-the-wire still allows zero-copy in certain JVMs and SSL configurations).

Troubleshooting

Under-Replicated Partitions

Symptom: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions > 0 for sustained periods.

Causes & responses:

  1. Slow follower fetch — check kafka.server:type=FetcherStats,name=BytesPerSec. Raise num.replica.fetchers or investigate disk/network on the slow follower.
  2. Broker down — check kafka.controller:type=KafkaController,name=ActiveBrokerCount. Restart the broker and let replication catch up.
  3. GC pauses — long GC on the leader stalls produce; on the follower stalls fetch. Inspect gc.log. Switch from G1 to ZGC or increase heap.
  4. Disk I/O saturationiostat -xm 1, look at %util and await. Move log directories to faster disks or split across more disks.

ISR Shrinkage

Symptom: kafka.server:type=ReplicaManager,name=IsrShrinksPerSec is non-zero.

Run kafka-topics.sh --describe --under-min-isr-partitions to identify affected partitions. With KIP-966 (ELR) on Kafka 4.0+, ISR shrinkage no longer immediately stalls writes for acks=all because the controller can elect from the ELR.

Broker Bounce Procedure

To restart a broker safely without producing under-replicated partitions for clients:

# 1. Drain producer/consumer traffic if behind a Kafka load balancer
# 2. Trigger a controlled shutdown so partition leaders move off this broker first
bin/kafka-server-stop.sh

# 3. Wait for leadership transfer (the controller logs "Moved leader of ..." messages)
# 4. Perform the maintenance (kernel patch, JVM upgrade, config change)
# 5. Restart
bin/kafka-server-start.sh -daemon config/server.properties

# 6. Trigger preferred-replica election once the broker has caught up
bin/kafka-leader-election.sh --bootstrap-server localhost:9092 \
  --election-type PREFERRED --all-topic-partitions

controlled.shutdown.enable=true (default) is what makes step 2 graceful.

Consumer Lag

bin/kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group orders-aggregator
# Inspect LAG column. Sustained growth = consumer is too slow or partitioned poorly.

Mitigations: scale consumer count up to partition count; raise max.poll.records if processing is fast; lower it if processing is slow and you're hitting max.poll.interval.ms; investigate downstream backpressure (e.g. database write latency).

KRaft Controller Issues

# Inspect the metadata quorum
bin/kafka-metadata-quorum.sh --bootstrap-server kafka:9092 describe --status

# Inspect snapshot situation
bin/kafka-metadata-quorum.sh --bootstrap-server kafka:9092 describe --replication

If the active controller flaps, check controller-host CPU saturation, network partition, or disk latency — controllers fsync the metadata log on every commit.

Cost Analysis

Cost driver Typical contributors Reduction levers
Storage Replication factor, retention, message size Tiered storage (KIP-405) to S3, compaction, shorter retention.ms, zstd compression
Network egress Cross-AZ replication, cross-region MM2, consumer fetches Rack-aware producers (broker.rack), consumer rack awareness (KIP-392), Cluster Linking with offset preservation
Compute Broker count, controller count, JVM CPU, TLS, compression Right-size brokers, isolate controllers on smaller hosts, ZGC, hardware AES-NI
Managed surcharges MSK / Confluent Cloud markups Self-host on K8s with Strimzi; or use serverless tiers (MSK Serverless, Confluent Basic) for spiky workloads

A representative AWS rule of thumb: cross-AZ traffic is by far the dominant variable cost for self-managed Kafka on EC2 — preferring intra-AZ produce/consume and using rack-aware fetching cuts the bill substantially.


Commands & Recipes

KRaft Cluster Bootstrap

# Generate a cluster ID
KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"

# Format the storage directory for the very first start
bin/kafka-storage.sh format \
  --cluster-id $KAFKA_CLUSTER_ID \
  --config config/kraft/server.properties

# Start the broker
bin/kafka-server-start.sh -daemon config/kraft/server.properties

# Verify cluster status
bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092

A minimal config/kraft/server.properties (combined controller+broker for dev):

process.roles=broker,controller
node.id=1
controller.quorum.bootstrap.servers=kafka1:9093,kafka2:9093,kafka3:9093
controller.listener.names=CONTROLLER
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
inter.broker.listener.name=PLAINTEXT
log.dirs=/var/lib/kafka/data
num.partitions=3
default.replication.factor=3
min.insync.replicas=2

Topic Lifecycle (kafka-topics.sh)

# Create
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic orders \
  --partitions 12 --replication-factor 3 \
  --config retention.ms=604800000 \
  --config min.insync.replicas=2 \
  --config compression.type=zstd \
  --config cleanup.policy=delete

# List
bin/kafka-topics.sh --bootstrap-server localhost:9092 --list

# Describe (shows leader, ISR, ELR per partition)
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic orders

# Alter (increase partitions — careful, breaks key ordering)
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic orders --partitions 24

# Identify problem partitions
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-min-isr-partitions

# Delete
bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic orders

Configuration (kafka-configs.sh)

# Inspect topic-level config
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name orders --describe

# Tighten retention for an existing topic
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name orders \
  --alter --add-config retention.ms=86400000

# Enable tiered storage on an existing topic
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name orders \
  --alter --add-config remote.storage.enable=true,local.retention.ms=3600000

# Set client quotas (per user, per client-id)
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type users --entity-name billing-svc \
  --add-config 'producer_byte_rate=10485760,consumer_byte_rate=20971520'

# Dynamic broker config (no restart)
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type brokers --entity-name 1 \
  --alter --add-config log.cleaner.threads=4

Consumer Groups (kafka-consumer-groups.sh)

# List
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Describe (LAG per partition)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group orders-aggregator

# Members + assignments
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group orders-aggregator --members --verbose

# Offset reset (DRY RUN first!)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group orders-aggregator --topic orders \
  --reset-offsets --to-earliest --dry-run

bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group orders-aggregator --topic orders \
  --reset-offsets --to-earliest --execute

# Reset to a wall-clock time
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group orders-aggregator --topic orders \
  --reset-offsets --to-datetime 2026-04-28T00:00:00.000 --execute

# Delete a group
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --delete --group orders-aggregator

ACLs (kafka-acls.sh)

# Allow Bob to consume topic Test-topic in group Group-1
bin/kafka-acls.sh --bootstrap-server localhost:9092 \
  --add --allow-principal User:Bob --consumer \
  --topic Test-topic --group Group-1

# Producer ACL with prefix pattern
bin/kafka-acls.sh --bootstrap-server localhost:9092 \
  --add --allow-principal User:Jane --producer \
  --topic 'orders.' --resource-pattern-type prefixed

# List existing ACLs
bin/kafka-acls.sh --bootstrap-server localhost:9092 --list

kcat (formerly kafkacat)

# Produce a single message (key=k1, value=v1)
echo "v1" | kcat -b localhost:9092 -t orders -k k1 -P

# Consume from beginning, print key + offset, exit at end
kcat -b localhost:9092 -t orders -C -o beginning -e \
  -f '%k\t%o\t%s\n'

# Print consumer-group offsets
kcat -b localhost:9092 -G orders-aggregator orders \
  -f '%k\t%o\t%s\n'

# Produce JSON to a topic from stdin (one record per line)
cat events.ndjson | kcat -b localhost:9092 -t events -P

Tiered Storage Setup

# server.properties (broker)
remote.log.storage.system.enable=true
remote.log.metadata.manager.listener.name=PLAINTEXT
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.LocalTieredStorage
remote.log.storage.manager.class.path=/opt/kafka/libs/tiered-storage-impl.jar
rsm.config.dir=/var/lib/kafka/remote-tier
rlmm.config.remote.log.metadata.topic.replication.factor=3
rlmm.config.remote.log.metadata.topic.min.isr=2
# Per-topic enablement
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic events-archive \
  --partitions 12 --replication-factor 3 \
  --config remote.storage.enable=true \
  --config local.retention.ms=3600000 \
  --config retention.ms=2592000000 \
  --config segment.bytes=536870912

Strimzi Operator (Kubernetes)

# kafka-cluster.yaml (Strimzi v0.51+)
apiVersion: kafka.strimzi.io/v1
kind: Kafka
metadata:
  name: orders-cluster
  namespace: kafka
spec:
  kafka:
    version: 4.2.0
    metadataVersion: 4.2-IV0
    replicas: 6
    listeners:
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "4.2"
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 500Gi
          deleteClaim: false
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
  entityOperator:
    topicOperator: {}
    userOperator: {}
# Install the Strimzi operator
helm repo add strimzi https://strimzi.io/charts/
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
  --namespace kafka --create-namespace

# Apply the Kafka CR
kubectl apply -f kafka-cluster.yaml -n kafka

# Watch reconciliation
kubectl get kafka -n kafka -w
kubectl get kafkanodepool -n kafka

Prometheus JMX Exporter

# kafka-metrics-config.yml — minimal high-value rules
rules:
  - pattern: "kafka.server<type=(.+), name=(.+)PerSec\\w*, topic=(.+)><>Count"
    name: kafka_server_$1_$2_total
    labels:
      topic: "$3"
    type: COUNTER
  - pattern: "kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value"
    name: kafka_server_replicamanager_underreplicated_partitions
    type: GAUGE
  - pattern: "kafka.server<type=ReplicaManager, name=IsrShrinksPerSec><>Count"
    name: kafka_server_replicamanager_isr_shrinks_total
    type: COUNTER
  - pattern: "kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value"
    name: kafka_controller_active_controllers
    type: GAUGE
  - pattern: "kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(.+)><>Count"
    name: kafka_network_requests_total
    labels:
      request: "$1"
    type: COUNTER

Run the exporter as a Java agent on each broker:

export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-metrics-config.yml"
bin/kafka-server-start.sh config/kraft/server.properties

Strimzi handles this automatically when metricsConfig.type: jmxPrometheusExporter is set.

Useful PromQL

# Total bytes in/out per second across the cluster
sum(rate(kafka_server_brokertopicmetrics_bytesin_total[5m]))
sum(rate(kafka_server_brokertopicmetrics_bytesout_total[5m]))

# Under-replicated partitions per broker
kafka_server_replicamanager_underreplicated_partitions

# ISR shrinkage rate (alert > 0 sustained)
sum(rate(kafka_server_replicamanager_isr_shrinks_total[5m])) by (instance)

# Active controller count (must be 1)
sum(kafka_controller_active_controllers)

# Consumer lag (requires kafka_exporter)
max(kafka_consumergroup_lag) by (consumergroup, topic)

Sources