Skip to content

Apache Kafka

The de facto standard distributed event streaming platform — durable replicated logs, exactly-once semantics, KRaft consensus, and a deep ecosystem (Connect, Streams, ksqlDB, Schema Registry).

Overview

Apache Kafka is an open-source distributed event streaming platform originally developed at LinkedIn (2011) and donated to the Apache Software Foundation. It stores events in durable, partitioned, replicated commit logs organized into topics, with producers appending records and consumers reading at their own pace via durable per-group offsets. Since Kafka 4.0 (March 2025), the platform runs KRaft-only — Apache ZooKeeper has been fully removed and metadata is now managed by an internal Raft quorum of controllers.

Kafka's value proposition rests on four properties: (1) high throughput per broker thanks to sequential disk I/O, page-cache reliance, and zero-copy sendfile(), (2) durable replication via the in-sync-replica (ISR) protocol with optional Eligible Leader Replicas (ELR, KIP-966), (3) exactly-once semantics (EOS) for read-process-write workloads using idempotent producers + transactions, and (4) a massive ecosystem (Kafka Connect for integration, Kafka Streams for stateful processing, ksqlDB for SQL on streams, Schema Registry for data contracts, MirrorMaker 2 for cross-cluster replication, and tiered storage for cost-efficient long retention).

Key Facts

Attribute Detail
Website kafka.apache.org
GitHub Stars ~30k+ (apache/kafka)
Latest Version 4.2.0 (released 2026-02-17), 4.1.2 patch line
Language Java + Scala (broker), Java/Go/Python/.NET/Rust clients
License Apache License 2.0 (permissive)
Origin / Maintainer Originated at LinkedIn (2011), Apache Software Foundation TLP since 2012
Primary Vendor Confluent (founded 2014 by original Kafka authors)
Coordination KRaft (Raft) — ZooKeeper removed in 4.0
Wire Protocol Binary TCP, schema versioned per ApiKey
Storage Format Append-only segmented log, magic v2 record batches

Evaluation

Pros

Pro Detail
Massive ecosystem Connect, Streams, ksqlDB, Schema Registry, hundreds of connectors
Durability + replication ISR protocol, configurable min.insync.replicas, ELR (KIP-966)
Exactly-once semantics Idempotent producer + transactions across topics/partitions
High throughput LinkedIn measured 2M writes/sec on 3 commodity machines (2014)
Replayable history Consumers re-read from any offset; durable for compliance and reprocessing
Tiered storage (KIP-405) GA in 3.6, offload cold segments to S3/GCS/HDFS — cuts storage cost dramatically
Truly open Apache 2.0, no BSL/SSPL relicensing risk like Redis or Elastic
Multi-language clients Official Java + many high-quality community clients (librdkafka, sarama, confluent-kafka-python)

Cons

Con Detail
Operational complexity Topic design, partition planning, ISR shrinkage, broker bounce procedures all require expertise
JVM-bound brokers Page cache and GC tuning matter; competing brokers (Redpanda) avoid the JVM entirely
Partition count limits Per-broker partition count is a real ceiling (10k–200k depending on hardware/version)
Consumer rebalance pain Pre-KIP-848 (4.0) rebalances stop the world; eager rebalancing still common in older clients
Cross-region replication MirrorMaker 2 is solid but not transparent — offset translation, lag, and cutover require care
Schema management not built-in Confluent Schema Registry, Apicurio, or Karapace must be deployed separately
Cold-start latency New consumers in a large group can wait seconds to minutes before processing

Architecture (Summary)

flowchart LR
    subgraph Producers["Producers"]
        P1["KafkaProducer<br/>(idempotent)"]
        P2["KafkaProducer<br/>(transactional)"]
    end

    subgraph KafkaCluster["Kafka Cluster (KRaft)"]
        direction TB
        subgraph ControllerQuorum["Controller Quorum (Raft)"]
            KC1["KafkaController 1"]
            KC2["KafkaController 2 (active)"]
            KC3["KafkaController 3"]
        end
        subgraph Brokers["Broker Pool"]
            KS1["KafkaServer 1<br/>LogManager / ReplicaManager"]
            KS2["KafkaServer 2<br/>LogManager / ReplicaManager"]
            KS3["KafkaServer 3<br/>LogManager / ReplicaManager"]
        end
        ControllerQuorum -- "metadata log<br/>__cluster_metadata" --> Brokers
    end

    subgraph RemoteStorage["Tiered Storage (KIP-405)"]
        S3["RemoteStorageManager<br/>(S3 / GCS / HDFS)"]
    end

    subgraph Consumers["Consumer Groups"]
        CG1["ConsumerGroup A"]
        CG2["ConsumerGroup B<br/>(transactional read_committed)"]
    end

    P1 -- "Produce v9" --> KS1
    P2 -- "Produce v9 (txn)" --> KS2
    KS1 -- "Replicate (Fetch)" --> KS2
    KS2 -- "Replicate (Fetch)" --> KS3
    KS1 -- "Cold segment upload" --> S3
    CG1 -- "Fetch v15" --> KS2
    CG2 -- "Fetch v15" --> KS3

    style ControllerQuorum fill:#1f3a5f,color:#fff
    style KS2 fill:#0d6e0d,color:#fff

Detailed architecture, KRaft consensus internals, replication protocol, log format, and benchmarks live in messaging/kafka/architecture.

Use Cases

Use Case Why Kafka Fits
Event sourcing Durable, replayable, ordered per partition
Microservices messaging backbone Decouples producers/consumers, supports fan-out and back-pressure
Real-time stream processing Kafka Streams, Flink, Spark Structured Streaming, ksqlDB integrations
CDC (change data capture) Debezium connectors stream Postgres/MySQL/Mongo bin-log events into topics
Log aggregation Replaces Scribe/Flume with replicated storage and replay
Metrics & telemetry transport OpenTelemetry exporters, Prometheus remote-write to Kafka
Data lake ingestion Connect S3 sink, Iceberg sink, Delta Lake bridge
Audit trail / immutable event log Compacted topics + retention policies
Activity stream / clickstream Original LinkedIn use case; high-cardinality, partitioned by user/session

Licensing & Pricing

  • Apache Kafka: Apache License 2.0 — fully free, no usage restrictions, no telemetry call-home, vendor-neutral.
  • Confluent Platform: Free Community License (CL) for some components (Schema Registry, REST Proxy historically); Confluent Enterprise (commercial) adds RBAC, audit logging, tiered storage UI, Cluster Linking, Control Center.
  • Confluent Cloud: Fully managed SaaS, priced per ingress GB / egress GB / partition-hour / cluster-type (Basic, Standard, Enterprise, Dedicated). Kora engine (Confluent's cloud-native rewrite) underlies Dedicated tier.
  • Managed alternatives: Amazon MSK (per-broker-hour + storage), MSK Serverless (per-partition-hour + GB), Aiven for Kafka, Instaclustr Managed Kafka, Azure Event Hubs Kafka API endpoint, Upstash Kafka.
  • Self-hosted on K8s: Strimzi (CNCF Sandbox, free) and the Bitnami Helm chart (note: Bitnami removed the public catalog free tier in July 2025).

License clarity

Unlike HashiCorp Vault (BSL 1.1), Redis (RSALv2/SSPLv1), Elastic (Elastic License v2), or MongoDB (SSPL), Apache Kafka has not been relicensed and remains pure Apache 2.0. Confluent's enterprise add-ons are separately licensed but the core broker is unaffected.

Ecosystem

Component Purpose
Kafka Connect Source/sink integration framework — JDBC, S3, Elastic, Mongo, Snowflake, BigQuery, Iceberg
Kafka Streams Embedded JVM library for stateful stream processing (KStream, KTable, joins, windowing)
ksqlDB SQL-on-streams engine (Confluent Community License)
Schema Registry Avro/JSON-Schema/Protobuf schema storage with compatibility checks (Confluent, Apicurio, Karapace)
MirrorMaker 2 Connect-based cross-cluster replication (KIP-382, supersedes MM1)
Cruise Control LinkedIn's automated rebalancing and self-healing controller
Strimzi CNCF Sandbox K8s operator with full Kafka CRDs
Debezium CDC connectors built on Kafka Connect
librdkafka C/C++ client library (used by Python, Go, .NET wrappers)
kcat (kafkacat) Swiss-army CLI for produce/consume/metadata
Conduktor / Kafka UI / AKHQ / Redpanda Console Web UIs for cluster admin and topic browsing
Streams Replication Manager Cloudera commercial replication

Compatibility & Requirements

  • JDK: Java 17 LTS or Java 21 LTS (Java 11 dropped in 4.0).
  • Operating System: Linux strongly preferred for production (sendfile, page cache tuning, epoll); macOS for dev only; Windows broker not supported in production.
  • Hardware: SSD strongly recommended for the log directory; 10GbE+ networking; >=32 GiB RAM per broker for sizable workloads (page cache).
  • Filesystem: XFS recommended over ext4; avoid network-attached storage for active log dirs (tiered storage is the right answer for cold data).
  • Container support: Official apache/kafka:4.2.0 image (KRaft-native); also widely deployed via Strimzi / Confluent images.
  • Wire protocol compatibility: Newer brokers accept older client API versions; clients should match or be newer than the broker for best feature support. Mixed-version cluster upgrades are supported via the inter.broker.protocol.version setting (now phased out in pure-KRaft 4.x).

Latest Versions

Version Release Highlights
4.2.0 2026-02-17 Share Groups (Queues, KIP-932) GA; Streams Rebalance Protocol GA; DLQ in exception handlers
4.1.x 2025-09 → 2026 Share Groups preview; Streams Rebalance (KIP-1071) early access; KIP-877 plugin metrics
4.0.0 2025-03-18 KRaft-only (ZooKeeper removed); Consumer Rebalance Protocol (KIP-848) GA; Eligible Leader Replicas (KIP-966 part 1); Java 11 dropped
3.9.x 2024–2025 Last ZK-supporting line; CVE-2025-27817/27818/27819 backports landed in 3.9.1
3.6.0 2023-10 Tiered Storage (KIP-405) GA
3.3.0 2022-10 KRaft mode declared production-ready

Alternatives

Alternative Style When to Prefer
Redpanda Kafka API, C++ broker, no JVM, no ZK Lower tail latency, simpler ops, single-binary deploy
Apache Pulsar Tiered (compute / BookKeeper storage) Built-in geo-replication, multi-tenant SaaS, native tiered storage
NATS / JetStream Lightweight pub/sub + stream Edge / IoT, ultra-low overhead, simpler operational model
RabbitMQ AMQP broker Per-message TTL, complex routing, traditional work-queues, priority
AWS Kinesis Data Streams Managed shard-based stream All-in on AWS; smaller scale; no Kafka API
Google Pub/Sub Managed at-least-once pub/sub All-in on GCP; serverless scaling; weaker ordering
Azure Event Hubs Managed, Kafka-protocol-compatible All-in on Azure; gateway speaks Kafka wire protocol
WarpStream S3-native Kafka-compatible Object-storage-only architecture; pay only for storage
AutoMQ S3-native Kafka fork Cloud-native cost optimization, Kafka API

See Streaming Brokers Comparison for a head-to-head.

Migration & Lock-in

  • API surface: The Kafka wire protocol is open; Redpanda, WarpStream, AutoMQ, Azure Event Hubs, and others reimplement it. Migrating Kafka clients to a wire-compatible alternative usually requires only a bootstrap.servers change.
  • Cross-cluster migration: MirrorMaker 2 (Connect-based) is the canonical tool — it replicates topic data, consumer offsets (via the Checkpoint connector), heartbeats, and ACLs. Confluent Cluster Linking is a more transparent commercial alternative that preserves offsets exactly.
  • Connector ecosystem lock-in: Some Confluent-licensed connectors (e.g. some premium cloud sinks) won't run outside Confluent Platform; standard open-source connectors (Debezium, jdbc, S3) port freely.
  • Operational lock-in: Tiered storage configuration is portable but the active RemoteLogMetadataManager topic isn't trivially moved — plan for a clean cutover.
  • Schema lock-in: Confluent Schema Registry's wire format includes a 5-byte magic + schema-ID prefix; Apicurio and Karapace implement the same wire format for compatibility.

Community Health

  • Apache TLP since October 2012; one of the most active TLPs by commit volume.
  • KIP (Kafka Improvement Proposal) process — public design docs voted on by PMC. Recent flagship KIPs include KIP-848 (consumer rebalance), KIP-932 (share groups / queues), KIP-405 (tiered storage), KIP-714 (client telemetry), KIP-966 (ELR), and KIP-1071 (Streams rebalance).
  • Quarterly minor releases; ~12 months of community support per minor.
  • Confluent employs many committers but the project remains independently governed by the Apache Software Foundation.
  • Strong third-party ecosystem: Strimzi (CNCF Sandbox), Debezium (Red Hat), Cruise Control (LinkedIn), Aiven, Instaclustr, AWS MSK, Azure Event Hubs.

Sources

Open Questions

  • Q: Is Share Groups (KIP-932) ready to replace RabbitMQ for queue-style workloads? — GA in 4.2 (Feb 2026), but message-level acknowledgements and DLQ semantics are new; mature shops should evaluate on staging before betting production traffic.
  • Q: When does tiered storage actually pay off vs. just over-provisioning local disk? — Generally once active retention exceeds ~7 days at >100 MB/s sustained write, and especially when egress patterns are mostly tail reads (page-cache hits) with occasional historical replays.
  • Q: How do KRaft controller failovers compare operationally to ZK failovers in real production? — Anecdotally faster (sub-second metadata propagation in many cases) and simpler to operate, but few public head-to-head latency studies exist for very large clusters (>200 brokers).
  • Q: What is the practical maximum partitions-per-broker on KRaft 4.x? — Pre-KRaft, the rule of thumb was ~4k partitions per broker before metadata propagation became painful. KRaft and ELR raise this substantially; published numbers from Confluent/AWS suggest 200k+ per cluster, but per-broker budgets depend heavily on hardware and replication factor.
  • Q: Is exactly-once semantics still expensive enough to avoid by default? — In Kafka 4.x with idempotent producer enabled by default, the marginal cost of EOS is much smaller than in 0.11–2.x; for read-process-write Streams jobs it is usually the right default.