Operations¶

Scope

Production deployment patterns for Apache Kafka 4.x in KRaft mode, performance tuning (producer/broker/consumer), troubleshooting (under-replicated partitions, ISR shrinkage, broker bounces), tiered storage operations, and cost analysis.

Production Deployment Patterns¶

Cluster Topology (KRaft)¶

Pattern	Controllers	Brokers	When to Use
Combined (`process.roles=controller,broker`)	1–3	same nodes	Dev / staging / very small prod (<5 brokers)
Isolated (`process.roles=controller` vs `broker`)	3 dedicated	N dedicated	Production. Recommended for any cluster you'd page someone for.
Cross-AZ	3 controllers spread across 3 AZs	brokers spread across 3 AZs, RF=3 with `min.insync.replicas=2`	Single-region HA
Cross-region (active-passive)	one cluster per region	MirrorMaker 2 replicating	DR with RPO measured in seconds
Cross-region (active-active)	one cluster per region	bidirectional MM2 with prefix-renamed topics	Lowest-RPO HA at the cost of dual writes / topic prefixing

A 3-node controller quorum tolerates 1 controller failure; 5 nodes tolerate 2. Most production clusters run 3 controllers.

Broker Sizing¶

CPU: 8–16 vCPU per broker is typical. ZSTD compression and TLS handshakes are the heaviest CPU consumers.
RAM: page cache is everything. Plan for >=32 GiB (often 64–128 GiB) so that the active workload fits in cache. JVM heap should be capped (4–6 GiB is usually enough; do not let it crowd out the page cache).
Disk: SSD/NVMe; XFS preferred over ext4. Separate log directory from OS disk. Multiple log directories on multiple disks distribute partition load.
Network: 10 GbE minimum, 25 GbE+ for large clusters. Replication doubles or triples your inbound bytes.
JVM: G1GC remains the default; ZGC has been adopted at large shops for sub-ms pauses.

Topic Design¶

Decision	Guidance
Partitions per topic	Pick by max consumer parallelism × headroom (e.g. 12, 24, 48). Hard to decrease later.
Replication factor	3 for production. RF=2 gives no headroom for a node failure during a rolling upgrade.
`min.insync.replicas`	Set to RF − 1 (typically 2 for RF=3). Combined with `acks=all` this gives durable writes.
Cleanup policy	`delete` (time/size retention) for events; `compact` for state; `compact,delete` for hybrid (e.g. CDC).
Retention	Set per topic via `retention.ms` and/or `retention.bytes`.
Compression	Producer-side `compression.type=zstd` is usually best ratio; `lz4` and `snappy` are lower CPU.
Tiered storage	For >7 days retention at >100 MB/s, enable `remote.storage.enable=true` per topic.

Partition Planning¶

Total partitions per cluster scales with brokers and hardware. A modest 6-broker KRaft cluster comfortably handles tens of thousands of partitions; large Confluent Cloud clusters serve hundreds of thousands.
Per-broker partition count drives metadata cache size, follower fetch threads, and recovery time after restart. Watch the kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount metric.
Increasing partitions later is cheap for new keys but breaks key-based ordering for existing keys.

Performance Tuning¶

Producer¶

Setting	Default	Tuning Advice
`acks`	`all` (4.x default)	Keep `all` for durability. Set to `1` only for low-stakes telemetry.
`enable.idempotence`	`true` (4.x default)	Always on; avoids duplicates without sacrificing throughput.
`compression.type`	`none`	Use `zstd` for highest ratio, `lz4`/`snappy` for lower CPU.
`batch.size`	16384 (16 KiB)	Increase to 64 KiB–1 MiB for throughput-oriented workloads.
`linger.ms`	5 (4.x default)	Was 0 pre-4.0. Higher = larger batches = better throughput, worse latency.
`max.in.flight.requests.per.connection`	5	Keep at 5 (max for idempotence in 4.x; auto-fallback was removed).
`buffer.memory`	32 MiB	Increase if you see `record-send-rate` drops or buffer-pool waits.

Broker¶

Setting	Default	Tuning Advice
`num.io.threads`	8	Match to disk parallelism; raise for many-disk hosts.
`num.network.threads`	3	Raise for high connection counts.
`num.replica.fetchers`	1	Raise to 4–8 for clusters with many followers per broker.
`socket.send.buffer.bytes` / `socket.receive.buffer.bytes`	100 KiB / 100 KiB	Match to BDP for fat networks.
`log.flush.interval.messages` / `log.flush.interval.ms`	unlimited	Do not enable unless you have a specific durability mandate; rely on RF + `acks=all` instead.
`log.segment.bytes`	1 GiB	Smaller segments = more retention granularity but more files.

Consumer¶

Setting	Default	Tuning Advice
`fetch.min.bytes`	1	Raise to 1–10 KiB for batching efficiency.
`fetch.max.wait.ms`	500	Pair with `fetch.min.bytes` to amortize fetches.
`max.poll.records`	500	Lower for slow per-record processing to avoid `max.poll.interval.ms` timeouts.
`isolation.level`	`read_uncommitted`	Set to `read_committed` for transactional pipelines.
`group.protocol`	`classic` (default)	Set to `consumer` to opt into KIP-848 next-gen rebalance.

Page Cache, Zero-Copy, Sendfile¶

Kafka deliberately delegates caching to the Linux page cache rather than maintaining a JVM-side buffer. Do not pin the JVM heap so large that it crowds out the cache; 4–6 GiB heap on a 64 GiB host is typical.
Tail reads (consumers near the head of the log) are served almost entirely from page cache.
For lagging consumers, the broker uses FileChannel.transferTo (Linux sendfile(2)) to stream the on-disk batch straight to the consumer socket without copying through user space — provided the data is not encrypted at rest by an in-broker mechanism (TLS-on-the-wire still allows zero-copy in certain JVMs and SSL configurations).

Troubleshooting¶

Under-Replicated Partitions¶

Symptom: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions > 0 for sustained periods.

Causes & responses:

Slow follower fetch — check kafka.server:type=FetcherStats,name=BytesPerSec. Raise num.replica.fetchers or investigate disk/network on the slow follower.
Broker down — check kafka.controller:type=KafkaController,name=ActiveBrokerCount. Restart the broker and let replication catch up.
GC pauses — long GC on the leader stalls produce; on the follower stalls fetch. Inspect gc.log. Switch from G1 to ZGC or increase heap.
Disk I/O saturation — iostat -xm 1, look at %util and await. Move log directories to faster disks or split across more disks.

ISR Shrinkage¶

Symptom: kafka.server:type=ReplicaManager,name=IsrShrinksPerSec is non-zero.

Run kafka-topics.sh --describe --under-min-isr-partitions to identify affected partitions. With KIP-966 (ELR) on Kafka 4.0+, ISR shrinkage no longer immediately stalls writes for acks=all because the controller can elect from the ELR.

Broker Bounce Procedure¶

To restart a broker safely without producing under-replicated partitions for clients:

# 1. Drain producer/consumer traffic if behind a Kafka load balancer
# 2. Trigger a controlled shutdown so partition leaders move off this broker first
bin/kafka-server-stop.sh

# 3. Wait for leadership transfer (the controller logs "Moved leader of ..." messages)
# 4. Perform the maintenance (kernel patch, JVM upgrade, config change)
# 5. Restart
bin/kafka-server-start.sh -daemon config/server.properties

# 6. Trigger preferred-replica election once the broker has caught up
bin/kafka-leader-election.sh --bootstrap-server localhost:9092 \
  --election-type PREFERRED --all-topic-partitions

controlled.shutdown.enable=true (default) is what makes step 2 graceful.

Consumer Lag¶

bin/kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group orders-aggregator
# Inspect LAG column. Sustained growth = consumer is too slow or partitioned poorly.

Mitigations: scale consumer count up to partition count; raise max.poll.records if processing is fast; lower it if processing is slow and you're hitting max.poll.interval.ms; investigate downstream backpressure (e.g. database write latency).

KRaft Controller Issues¶

# Inspect the metadata quorum
bin/kafka-metadata-quorum.sh --bootstrap-server kafka:9092 describe --status

# Inspect snapshot situation
bin/kafka-metadata-quorum.sh --bootstrap-server kafka:9092 describe --replication

If the active controller flaps, check controller-host CPU saturation, network partition, or disk latency — controllers fsync the metadata log on every commit.

Cost Analysis¶

Cost driver	Typical contributors	Reduction levers
Storage	Replication factor, retention, message size	Tiered storage (KIP-405) to S3, compaction, shorter `retention.ms`, zstd compression
Network egress	Cross-AZ replication, cross-region MM2, consumer fetches	Rack-aware producers (`broker.rack`), consumer rack awareness (KIP-392), Cluster Linking with offset preservation
Compute	Broker count, controller count, JVM CPU, TLS, compression	Right-size brokers, isolate controllers on smaller hosts, ZGC, hardware AES-NI
Managed surcharges	MSK / Confluent Cloud markups	Self-host on K8s with Strimzi; or use serverless tiers (MSK Serverless, Confluent Basic) for spiky workloads

A representative AWS rule of thumb: cross-AZ traffic is by far the dominant variable cost for self-managed Kafka on EC2 — preferring intra-AZ produce/consume and using rack-aware fetching cuts the bill substantially.

Commands & Recipes¶

KRaft Cluster Bootstrap¶

# Generate a cluster ID
KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"

# Format the storage directory for the very first start
bin/kafka-storage.sh format \
  --cluster-id $KAFKA_CLUSTER_ID \
  --config config/kraft/server.properties

# Start the broker
bin/kafka-server-start.sh -daemon config/kraft/server.properties

# Verify cluster status
bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092

A minimal config/kraft/server.properties (combined controller+broker for dev):

process.roles=broker,controller
node.id=1
controller.quorum.bootstrap.servers=kafka1:9093,kafka2:9093,kafka3:9093
controller.listener.names=CONTROLLER
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
inter.broker.listener.name=PLAINTEXT
log.dirs=/var/lib/kafka/data
num.partitions=3
default.replication.factor=3
min.insync.replicas=2

Topic Lifecycle (kafka-topics.sh)¶

# Create
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic orders \
  --partitions 12 --replication-factor 3 \
  --config retention.ms=604800000 \
  --config min.insync.replicas=2 \
  --config compression.type=zstd \
  --config cleanup.policy=delete

# List
bin/kafka-topics.sh --bootstrap-server localhost:9092 --list

# Describe (shows leader, ISR, ELR per partition)
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic orders

# Alter (increase partitions — careful, breaks key ordering)
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic orders --partitions 24

# Identify problem partitions
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-min-isr-partitions

# Delete
bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic orders

Configuration (kafka-configs.sh)¶

# Inspect topic-level config
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name orders --describe

# Tighten retention for an existing topic
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name orders \
  --alter --add-config retention.ms=86400000

# Enable tiered storage on an existing topic
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type topics --entity-name orders \
  --alter --add-config remote.storage.enable=true,local.retention.ms=3600000

# Set client quotas (per user, per client-id)
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type users --entity-name billing-svc \
  --add-config 'producer_byte_rate=10485760,consumer_byte_rate=20971520'

# Dynamic broker config (no restart)
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
  --entity-type brokers --entity-name 1 \
  --alter --add-config log.cleaner.threads=4

Consumer Groups (kafka-consumer-groups.sh)¶

# List
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Describe (LAG per partition)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group orders-aggregator

# Members + assignments
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group orders-aggregator --members --verbose

# Offset reset (DRY RUN first!)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group orders-aggregator --topic orders \
  --reset-offsets --to-earliest --dry-run

bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group orders-aggregator --topic orders \
  --reset-offsets --to-earliest --execute

# Reset to a wall-clock time
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group orders-aggregator --topic orders \
  --reset-offsets --to-datetime 2026-04-28T00:00:00.000 --execute

# Delete a group
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --delete --group orders-aggregator

ACLs (kafka-acls.sh)¶

# Allow Bob to consume topic Test-topic in group Group-1
bin/kafka-acls.sh --bootstrap-server localhost:9092 \
  --add --allow-principal User:Bob --consumer \
  --topic Test-topic --group Group-1

# Producer ACL with prefix pattern
bin/kafka-acls.sh --bootstrap-server localhost:9092 \
  --add --allow-principal User:Jane --producer \
  --topic 'orders.' --resource-pattern-type prefixed

# List existing ACLs
bin/kafka-acls.sh --bootstrap-server localhost:9092 --list

kcat (formerly kafkacat)¶

# Produce a single message (key=k1, value=v1)
echo "v1" | kcat -b localhost:9092 -t orders -k k1 -P

# Consume from beginning, print key + offset, exit at end
kcat -b localhost:9092 -t orders -C -o beginning -e \
  -f '%k\t%o\t%s\n'

# Print consumer-group offsets
kcat -b localhost:9092 -G orders-aggregator orders \
  -f '%k\t%o\t%s\n'

# Produce JSON to a topic from stdin (one record per line)
cat events.ndjson | kcat -b localhost:9092 -t events -P

Tiered Storage Setup¶

# server.properties (broker)
remote.log.storage.system.enable=true
remote.log.metadata.manager.listener.name=PLAINTEXT
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.LocalTieredStorage
remote.log.storage.manager.class.path=/opt/kafka/libs/tiered-storage-impl.jar
rsm.config.dir=/var/lib/kafka/remote-tier
rlmm.config.remote.log.metadata.topic.replication.factor=3
rlmm.config.remote.log.metadata.topic.min.isr=2

# Per-topic enablement
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
  --create --topic events-archive \
  --partitions 12 --replication-factor 3 \
  --config remote.storage.enable=true \
  --config local.retention.ms=3600000 \
  --config retention.ms=2592000000 \
  --config segment.bytes=536870912

Strimzi Operator (Kubernetes)¶

# kafka-cluster.yaml (Strimzi v0.51+)
apiVersion: kafka.strimzi.io/v1
kind: Kafka
metadata:
  name: orders-cluster
  namespace: kafka
spec:
  kafka:
    version: 4.2.0
    metadataVersion: 4.2-IV0
    replicas: 6
    listeners:
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "4.2"
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 500Gi
          deleteClaim: false
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
  entityOperator:
    topicOperator: {}
    userOperator: {}

# Install the Strimzi operator
helm repo add strimzi https://strimzi.io/charts/
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
  --namespace kafka --create-namespace

# Apply the Kafka CR
kubectl apply -f kafka-cluster.yaml -n kafka

# Watch reconciliation
kubectl get kafka -n kafka -w
kubectl get kafkanodepool -n kafka

Prometheus JMX Exporter¶

# kafka-metrics-config.yml — minimal high-value rules
rules:
  - pattern: "kafka.server<type=(.+), name=(.+)PerSec\\w*, topic=(.+)><>Count"
    name: kafka_server_$1_$2_total
    labels:
      topic: "$3"
    type: COUNTER
  - pattern: "kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value"
    name: kafka_server_replicamanager_underreplicated_partitions
    type: GAUGE
  - pattern: "kafka.server<type=ReplicaManager, name=IsrShrinksPerSec><>Count"
    name: kafka_server_replicamanager_isr_shrinks_total
    type: COUNTER
  - pattern: "kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value"
    name: kafka_controller_active_controllers
    type: GAUGE
  - pattern: "kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(.+)><>Count"
    name: kafka_network_requests_total
    labels:
      request: "$1"
    type: COUNTER

Run the exporter as a Java agent on each broker:

export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-metrics-config.yml"
bin/kafka-server-start.sh config/kraft/server.properties

Strimzi handles this automatically when metricsConfig.type: jmxPrometheusExporter is set.

Useful PromQL¶

# Total bytes in/out per second across the cluster
sum(rate(kafka_server_brokertopicmetrics_bytesin_total[5m]))
sum(rate(kafka_server_brokertopicmetrics_bytesout_total[5m]))

# Under-replicated partitions per broker
kafka_server_replicamanager_underreplicated_partitions

# ISR shrinkage rate (alert > 0 sustained)
sum(rate(kafka_server_replicamanager_isr_shrinks_total[5m])) by (instance)

# Active controller count (must be 1)
sum(kafka_controller_active_controllers)

# Consumer lag (requires kafka_exporter)
max(kafka_consumergroup_lag) by (consumergroup, topic)