Operations¶
Scope
Production deployment patterns for Apache Kafka 4.x in KRaft mode, performance tuning (producer/broker/consumer), troubleshooting (under-replicated partitions, ISR shrinkage, broker bounces), tiered storage operations, and cost analysis.
Related Notes
messaging/kafka/index | messaging/kafka/architecture | messaging/kafka/security | messaging/index
Production Deployment Patterns¶
Cluster Topology (KRaft)¶
| Pattern | Controllers | Brokers | When to Use |
|---|---|---|---|
Combined (process.roles=controller,broker) |
1–3 | same nodes | Dev / staging / very small prod (<5 brokers) |
Isolated (process.roles=controller vs broker) |
3 dedicated | N dedicated | Production. Recommended for any cluster you'd page someone for. |
| Cross-AZ | 3 controllers spread across 3 AZs | brokers spread across 3 AZs, RF=3 with min.insync.replicas=2 |
Single-region HA |
| Cross-region (active-passive) | one cluster per region | MirrorMaker 2 replicating | DR with RPO measured in seconds |
| Cross-region (active-active) | one cluster per region | bidirectional MM2 with prefix-renamed topics | Lowest-RPO HA at the cost of dual writes / topic prefixing |
A 3-node controller quorum tolerates 1 controller failure; 5 nodes tolerate 2. Most production clusters run 3 controllers.
Broker Sizing¶
- CPU: 8–16 vCPU per broker is typical. ZSTD compression and TLS handshakes are the heaviest CPU consumers.
- RAM: page cache is everything. Plan for >=32 GiB (often 64–128 GiB) so that the active workload fits in cache. JVM heap should be capped (4–6 GiB is usually enough; do not let it crowd out the page cache).
- Disk: SSD/NVMe; XFS preferred over ext4. Separate log directory from OS disk. Multiple log directories on multiple disks distribute partition load.
- Network: 10 GbE minimum, 25 GbE+ for large clusters. Replication doubles or triples your inbound bytes.
- JVM: G1GC remains the default; ZGC has been adopted at large shops for sub-ms pauses.
Topic Design¶
| Decision | Guidance |
|---|---|
| Partitions per topic | Pick by max consumer parallelism × headroom (e.g. 12, 24, 48). Hard to decrease later. |
| Replication factor | 3 for production. RF=2 gives no headroom for a node failure during a rolling upgrade. |
min.insync.replicas |
Set to RF − 1 (typically 2 for RF=3). Combined with acks=all this gives durable writes. |
| Cleanup policy | delete (time/size retention) for events; compact for state; compact,delete for hybrid (e.g. CDC). |
| Retention | Set per topic via retention.ms and/or retention.bytes. |
| Compression | Producer-side compression.type=zstd is usually best ratio; lz4 and snappy are lower CPU. |
| Tiered storage | For >7 days retention at >100 MB/s, enable remote.storage.enable=true per topic. |
Partition Planning¶
- Total partitions per cluster scales with brokers and hardware. A modest 6-broker KRaft cluster comfortably handles tens of thousands of partitions; large Confluent Cloud clusters serve hundreds of thousands.
- Per-broker partition count drives metadata cache size, follower fetch threads, and recovery time after restart. Watch the
kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCountmetric. - Increasing partitions later is cheap for new keys but breaks key-based ordering for existing keys.
Performance Tuning¶
Producer¶
| Setting | Default | Tuning Advice |
|---|---|---|
acks |
all (4.x default) |
Keep all for durability. Set to 1 only for low-stakes telemetry. |
enable.idempotence |
true (4.x default) |
Always on; avoids duplicates without sacrificing throughput. |
compression.type |
none |
Use zstd for highest ratio, lz4/snappy for lower CPU. |
batch.size |
16384 (16 KiB) | Increase to 64 KiB–1 MiB for throughput-oriented workloads. |
linger.ms |
5 (4.x default) | Was 0 pre-4.0. Higher = larger batches = better throughput, worse latency. |
max.in.flight.requests.per.connection |
5 | Keep at 5 (max for idempotence in 4.x; auto-fallback was removed). |
buffer.memory |
32 MiB | Increase if you see record-send-rate drops or buffer-pool waits. |
Broker¶
| Setting | Default | Tuning Advice |
|---|---|---|
num.io.threads |
8 | Match to disk parallelism; raise for many-disk hosts. |
num.network.threads |
3 | Raise for high connection counts. |
num.replica.fetchers |
1 | Raise to 4–8 for clusters with many followers per broker. |
socket.send.buffer.bytes / socket.receive.buffer.bytes |
100 KiB / 100 KiB | Match to BDP for fat networks. |
log.flush.interval.messages / log.flush.interval.ms |
unlimited | Do not enable unless you have a specific durability mandate; rely on RF + acks=all instead. |
log.segment.bytes |
1 GiB | Smaller segments = more retention granularity but more files. |
Consumer¶
| Setting | Default | Tuning Advice |
|---|---|---|
fetch.min.bytes |
1 | Raise to 1–10 KiB for batching efficiency. |
fetch.max.wait.ms |
500 | Pair with fetch.min.bytes to amortize fetches. |
max.poll.records |
500 | Lower for slow per-record processing to avoid max.poll.interval.ms timeouts. |
isolation.level |
read_uncommitted |
Set to read_committed for transactional pipelines. |
group.protocol |
classic (default) |
Set to consumer to opt into KIP-848 next-gen rebalance. |
Page Cache, Zero-Copy, Sendfile¶
- Kafka deliberately delegates caching to the Linux page cache rather than maintaining a JVM-side buffer. Do not pin the JVM heap so large that it crowds out the cache; 4–6 GiB heap on a 64 GiB host is typical.
- Tail reads (consumers near the head of the log) are served almost entirely from page cache.
- For lagging consumers, the broker uses
FileChannel.transferTo(Linuxsendfile(2)) to stream the on-disk batch straight to the consumer socket without copying through user space — provided the data is not encrypted at rest by an in-broker mechanism (TLS-on-the-wire still allows zero-copy in certain JVMs and SSL configurations).
Troubleshooting¶
Under-Replicated Partitions¶
Symptom: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions > 0 for sustained periods.
Causes & responses:
- Slow follower fetch — check
kafka.server:type=FetcherStats,name=BytesPerSec. Raisenum.replica.fetchersor investigate disk/network on the slow follower. - Broker down — check
kafka.controller:type=KafkaController,name=ActiveBrokerCount. Restart the broker and let replication catch up. - GC pauses — long GC on the leader stalls produce; on the follower stalls fetch. Inspect
gc.log. Switch from G1 to ZGC or increase heap. - Disk I/O saturation —
iostat -xm 1, look at%utilandawait. Move log directories to faster disks or split across more disks.
ISR Shrinkage¶
Symptom: kafka.server:type=ReplicaManager,name=IsrShrinksPerSec is non-zero.
Run kafka-topics.sh --describe --under-min-isr-partitions to identify affected partitions. With KIP-966 (ELR) on Kafka 4.0+, ISR shrinkage no longer immediately stalls writes for acks=all because the controller can elect from the ELR.
Broker Bounce Procedure¶
To restart a broker safely without producing under-replicated partitions for clients:
# 1. Drain producer/consumer traffic if behind a Kafka load balancer
# 2. Trigger a controlled shutdown so partition leaders move off this broker first
bin/kafka-server-stop.sh
# 3. Wait for leadership transfer (the controller logs "Moved leader of ..." messages)
# 4. Perform the maintenance (kernel patch, JVM upgrade, config change)
# 5. Restart
bin/kafka-server-start.sh -daemon config/server.properties
# 6. Trigger preferred-replica election once the broker has caught up
bin/kafka-leader-election.sh --bootstrap-server localhost:9092 \
--election-type PREFERRED --all-topic-partitions
controlled.shutdown.enable=true (default) is what makes step 2 graceful.
Consumer Lag¶
bin/kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--describe --group orders-aggregator
# Inspect LAG column. Sustained growth = consumer is too slow or partitioned poorly.
Mitigations: scale consumer count up to partition count; raise max.poll.records if processing is fast; lower it if processing is slow and you're hitting max.poll.interval.ms; investigate downstream backpressure (e.g. database write latency).
KRaft Controller Issues¶
# Inspect the metadata quorum
bin/kafka-metadata-quorum.sh --bootstrap-server kafka:9092 describe --status
# Inspect snapshot situation
bin/kafka-metadata-quorum.sh --bootstrap-server kafka:9092 describe --replication
If the active controller flaps, check controller-host CPU saturation, network partition, or disk latency — controllers fsync the metadata log on every commit.
Cost Analysis¶
| Cost driver | Typical contributors | Reduction levers |
|---|---|---|
| Storage | Replication factor, retention, message size | Tiered storage (KIP-405) to S3, compaction, shorter retention.ms, zstd compression |
| Network egress | Cross-AZ replication, cross-region MM2, consumer fetches | Rack-aware producers (broker.rack), consumer rack awareness (KIP-392), Cluster Linking with offset preservation |
| Compute | Broker count, controller count, JVM CPU, TLS, compression | Right-size brokers, isolate controllers on smaller hosts, ZGC, hardware AES-NI |
| Managed surcharges | MSK / Confluent Cloud markups | Self-host on K8s with Strimzi; or use serverless tiers (MSK Serverless, Confluent Basic) for spiky workloads |
A representative AWS rule of thumb: cross-AZ traffic is by far the dominant variable cost for self-managed Kafka on EC2 — preferring intra-AZ produce/consume and using rack-aware fetching cuts the bill substantially.
Commands & Recipes¶
KRaft Cluster Bootstrap¶
# Generate a cluster ID
KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
# Format the storage directory for the very first start
bin/kafka-storage.sh format \
--cluster-id $KAFKA_CLUSTER_ID \
--config config/kraft/server.properties
# Start the broker
bin/kafka-server-start.sh -daemon config/kraft/server.properties
# Verify cluster status
bin/kafka-broker-api-versions.sh --bootstrap-server localhost:9092
A minimal config/kraft/server.properties (combined controller+broker for dev):
process.roles=broker,controller
node.id=1
controller.quorum.bootstrap.servers=kafka1:9093,kafka2:9093,kafka3:9093
controller.listener.names=CONTROLLER
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
inter.broker.listener.name=PLAINTEXT
log.dirs=/var/lib/kafka/data
num.partitions=3
default.replication.factor=3
min.insync.replicas=2
Topic Lifecycle (kafka-topics.sh)¶
# Create
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
--create --topic orders \
--partitions 12 --replication-factor 3 \
--config retention.ms=604800000 \
--config min.insync.replicas=2 \
--config compression.type=zstd \
--config cleanup.policy=delete
# List
bin/kafka-topics.sh --bootstrap-server localhost:9092 --list
# Describe (shows leader, ISR, ELR per partition)
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic orders
# Alter (increase partitions — careful, breaks key ordering)
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
--alter --topic orders --partitions 24
# Identify problem partitions
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-min-isr-partitions
# Delete
bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic orders
Configuration (kafka-configs.sh)¶
# Inspect topic-level config
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name orders --describe
# Tighten retention for an existing topic
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name orders \
--alter --add-config retention.ms=86400000
# Enable tiered storage on an existing topic
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type topics --entity-name orders \
--alter --add-config remote.storage.enable=true,local.retention.ms=3600000
# Set client quotas (per user, per client-id)
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
--alter --entity-type users --entity-name billing-svc \
--add-config 'producer_byte_rate=10485760,consumer_byte_rate=20971520'
# Dynamic broker config (no restart)
bin/kafka-configs.sh --bootstrap-server localhost:9092 \
--entity-type brokers --entity-name 1 \
--alter --add-config log.cleaner.threads=4
Consumer Groups (kafka-consumer-groups.sh)¶
# List
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
# Describe (LAG per partition)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group orders-aggregator
# Members + assignments
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group orders-aggregator --members --verbose
# Offset reset (DRY RUN first!)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group orders-aggregator --topic orders \
--reset-offsets --to-earliest --dry-run
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group orders-aggregator --topic orders \
--reset-offsets --to-earliest --execute
# Reset to a wall-clock time
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group orders-aggregator --topic orders \
--reset-offsets --to-datetime 2026-04-28T00:00:00.000 --execute
# Delete a group
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--delete --group orders-aggregator
ACLs (kafka-acls.sh)¶
# Allow Bob to consume topic Test-topic in group Group-1
bin/kafka-acls.sh --bootstrap-server localhost:9092 \
--add --allow-principal User:Bob --consumer \
--topic Test-topic --group Group-1
# Producer ACL with prefix pattern
bin/kafka-acls.sh --bootstrap-server localhost:9092 \
--add --allow-principal User:Jane --producer \
--topic 'orders.' --resource-pattern-type prefixed
# List existing ACLs
bin/kafka-acls.sh --bootstrap-server localhost:9092 --list
kcat (formerly kafkacat)¶
# Produce a single message (key=k1, value=v1)
echo "v1" | kcat -b localhost:9092 -t orders -k k1 -P
# Consume from beginning, print key + offset, exit at end
kcat -b localhost:9092 -t orders -C -o beginning -e \
-f '%k\t%o\t%s\n'
# Print consumer-group offsets
kcat -b localhost:9092 -G orders-aggregator orders \
-f '%k\t%o\t%s\n'
# Produce JSON to a topic from stdin (one record per line)
cat events.ndjson | kcat -b localhost:9092 -t events -P
Tiered Storage Setup¶
# server.properties (broker)
remote.log.storage.system.enable=true
remote.log.metadata.manager.listener.name=PLAINTEXT
remote.log.storage.manager.class.name=org.apache.kafka.server.log.remote.storage.LocalTieredStorage
remote.log.storage.manager.class.path=/opt/kafka/libs/tiered-storage-impl.jar
rsm.config.dir=/var/lib/kafka/remote-tier
rlmm.config.remote.log.metadata.topic.replication.factor=3
rlmm.config.remote.log.metadata.topic.min.isr=2
# Per-topic enablement
bin/kafka-topics.sh --bootstrap-server localhost:9092 \
--create --topic events-archive \
--partitions 12 --replication-factor 3 \
--config remote.storage.enable=true \
--config local.retention.ms=3600000 \
--config retention.ms=2592000000 \
--config segment.bytes=536870912
Strimzi Operator (Kubernetes)¶
# kafka-cluster.yaml (Strimzi v0.51+)
apiVersion: kafka.strimzi.io/v1
kind: Kafka
metadata:
name: orders-cluster
namespace: kafka
spec:
kafka:
version: 4.2.0
metadataVersion: 4.2-IV0
replicas: 6
listeners:
- name: tls
port: 9093
type: internal
tls: true
authentication:
type: tls
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
inter.broker.protocol.version: "4.2"
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 500Gi
deleteClaim: false
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
entityOperator:
topicOperator: {}
userOperator: {}
# Install the Strimzi operator
helm repo add strimzi https://strimzi.io/charts/
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
--namespace kafka --create-namespace
# Apply the Kafka CR
kubectl apply -f kafka-cluster.yaml -n kafka
# Watch reconciliation
kubectl get kafka -n kafka -w
kubectl get kafkanodepool -n kafka
Prometheus JMX Exporter¶
# kafka-metrics-config.yml — minimal high-value rules
rules:
- pattern: "kafka.server<type=(.+), name=(.+)PerSec\\w*, topic=(.+)><>Count"
name: kafka_server_$1_$2_total
labels:
topic: "$3"
type: COUNTER
- pattern: "kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value"
name: kafka_server_replicamanager_underreplicated_partitions
type: GAUGE
- pattern: "kafka.server<type=ReplicaManager, name=IsrShrinksPerSec><>Count"
name: kafka_server_replicamanager_isr_shrinks_total
type: COUNTER
- pattern: "kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value"
name: kafka_controller_active_controllers
type: GAUGE
- pattern: "kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(.+)><>Count"
name: kafka_network_requests_total
labels:
request: "$1"
type: COUNTER
Run the exporter as a Java agent on each broker:
export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-metrics-config.yml"
bin/kafka-server-start.sh config/kraft/server.properties
Strimzi handles this automatically when metricsConfig.type: jmxPrometheusExporter is set.
Useful PromQL¶
# Total bytes in/out per second across the cluster
sum(rate(kafka_server_brokertopicmetrics_bytesin_total[5m]))
sum(rate(kafka_server_brokertopicmetrics_bytesout_total[5m]))
# Under-replicated partitions per broker
kafka_server_replicamanager_underreplicated_partitions
# ISR shrinkage rate (alert > 0 sustained)
sum(rate(kafka_server_replicamanager_isr_shrinks_total[5m])) by (instance)
# Active controller count (must be 1)
sum(kafka_controller_active_controllers)
# Consumer lag (requires kafka_exporter)
max(kafka_consumergroup_lag) by (consumergroup, topic)