Apache Pulsar¶
Cloud-native messaging + streaming with segregated compute and storage, multi-tenancy from day one, and built-in geo-replication.
Overview¶
Apache Pulsar is a distributed pub-sub and streaming platform with a fundamentally different architecture from Kafka: brokers are stateless while storage lives in Apache BookKeeper bookies. This separation lets compute and storage scale independently, and lets the same broker fleet serve hot, cold, and tiered-to-S3 reads transparently.
Pulsar was contributed to Apache by Yahoo (2016) and is now developed by a wide vendor base — StreamNative, DataStax, Tencent, Alibaba, and others.
Key Facts¶
| Attribute | Detail |
|---|---|
| Website | pulsar.apache.org |
| GitHub | apache/pulsar |
| Stars | ~14k+ ⭐ |
| Latest Version | 4.1.x (CY 2026 stable line) |
| Language | Java |
| License | Apache-2.0 |
| Steward | Apache Software Foundation; commercial support from StreamNative, DataStax |
| Wire Protocols | Pulsar protocol (native), Kafka API (KoP plugin), MQTT (MoP), AMQP (AoP) |
Evaluation¶
| Pros | Cons |
|---|---|
| Compute/storage separation scales independently | More moving parts: brokers + BookKeeper + metadata store |
| Multi-tenancy is native (tenants → namespaces → topics) | Operationally heavier than NATS or Redpanda |
| Geo-replication is a config flag (per-namespace) | Geo-replication ordering caveats need careful reading |
| Tiered storage to S3 / GCS / Azure Blob | Java/JVM tax (heap tuning, GC) |
| Pulsar Functions = stream processing in-broker | Pulsar Functions runtime needs its own ops attention |
| Schema Registry is built-in | KoP/MoP/AoP feature parity lags native |
| Subscription types: Exclusive / Failover / Shared / Key_Shared | Some Kafka tooling expects ZooKeeper / KRaft directly |
| Strong consistency via BookKeeper write quorums | Cursor management can be tricky on partitioned topics |
Architecture¶
flowchart TB
Producer["Producer / Consumer client"]
subgraph BrokerLayer["Pulsar Brokers (stateless)"]
Broker1["PulsarBroker 1"]
Broker2["PulsarBroker 2"]
Broker3["PulsarBroker 3"]
end
subgraph BookKeeper["Apache BookKeeper (storage)"]
Bookie1["BookKeeperBookie 1"]
Bookie2["BookKeeperBookie 2"]
Bookie3["BookKeeperBookie 3"]
Bookie4["BookKeeperBookie 4"]
end
subgraph Metadata["Metadata layer"]
ZK["ZooKeeper / etcd / RocksDB"]
ConfigStore["Configuration Store (global)"]
end
subgraph TieredStorage["Tiered storage"]
S3["S3 / GCS / Azure Blob"]
end
Producer --> Broker1
Producer --> Broker2
Producer --> Broker3
Broker1 --> Bookie1
Broker1 --> Bookie2
Broker1 --> Bookie3
Broker2 --> Bookie2
Broker2 --> Bookie3
Broker2 --> Bookie4
Broker3 --> Bookie1
Broker3 --> Bookie3
Broker3 --> Bookie4
Broker1 -.-> ZK
Broker2 -.-> ZK
Broker3 -.-> ZK
ZK -.-> ConfigStore
Bookie1 -.-> ZK
Bookie2 -.-> ZK
Bookie3 -.-> ZK
Bookie4 -.-> ZK
Broker1 --> S3
Broker2 --> S3
See messaging/pulsar/architecture for component-level details.
Use Cases¶
- Multi-tenant SaaS messaging — tenants → namespaces → topics with per-tenant resource quotas.
- Geo-distributed streaming — built-in cross-region replication, no external MirrorMaker-equivalent.
- Hybrid cloud / hybrid edge — brokers near applications, bookies near storage.
- Independent compute/storage scaling — high-fanout consumers don't strain storage; large retention doesn't strain brokers.
- Stream processing inside the cluster — Pulsar Functions for ETL, enrichment, simple transformations.
- IoT message ingest — combine native Pulsar with MQTT-on-Pulsar (MoP) for device traffic.
Licensing & Pricing¶
- Apache Pulsar: Apache-2.0, free for any use.
- StreamNative Cloud: managed cloud Pulsar (BYOC, Dedicated, Serverless tiers).
- DataStax Astra Streaming: managed Pulsar (acquired by IBM in 2025).
- Tencent Cloud TDMQ: managed Pulsar in Tencent Cloud.
Ecosystem¶
- Pulsar Functions — lightweight stream-processing runtime in-broker or as separate workers.
- Pulsar IO connectors — source/sink connectors for Kafka, JDBC, S3, Elasticsearch, MongoDB, etc.
- Pulsar SQL — Trino-based SQL over topic data.
- Schema Registry — built-in; supports Avro, JSON, Protobuf, and key-value composites.
- KoP (Kafka-on-Pulsar) — Pulsar broker speaks Kafka wire protocol.
- MoP (MQTT-on-Pulsar) — accept MQTT clients on the broker.
- AoP (AMQP-on-Pulsar) — accept AMQP 0-9-1 clients on the broker.
- Clients — Java (reference), Go, Python, Node.js, C++, C#, Rust.
Compatibility & Requirements¶
| Requirement | Detail |
|---|---|
| Brokers | Java 17+; 4 vCPU + 8 GB heap baseline |
| Bookies | Java 17+; SSD/NVMe for journal + ledger dirs |
| Metadata store | ZooKeeper 3.8+, etcd 3.5+, or RocksDB (standalone) |
| Configuration store | Per-instance global ZooKeeper for multi-cluster |
| Tiered storage | S3, GCS, Azure Blob |
| Network | TCP 6650 (binary), 8080 (HTTP REST), 6651 (TLS), 8443 (HTTPS) |
| Container | Official Apache images at apachepulsar/pulsar |
Latest Versions¶
- 4.1.x — current 2026 stable: improved load balancer, lazier topic loading, transactional consumer fixes.
- 4.0.x — added per-broker pluggable load balancer, cursor protobuf optimizations.
- 3.x — long-term-supported branch; ZooKeeper-only metadata.
- 2.10.x — older; missing many recent multi-tenant features.
Track at pulsar.apache.org/release-notes.
Alternatives¶
- Apache Kafka — single-tier compute+storage; simpler model but coupled scaling.
- Redpanda — Kafka API in C++; per-partition Raft.
- NATS — lighter, lower-latency request-reply; account isolation rather than tenants/namespaces.
- RabbitMQ — AMQP routing primitives; not log-replay-oriented.
- AWS Kinesis / Azure Event Hubs / GCP Pub/Sub — managed cloud-native equivalents.
Migration & Lock-in¶
- Kafka clients work via KoP plugin — but feature parity isn't 1:1.
- Tiered storage offload format is portable — segments are stored as a known format (Apache MLP).
- Pulsar Functions are Pulsar-specific; rewriting against Flink or Beam is non-trivial.
- Schema Registry is wire-compatible with Confluent SR for Avro use cases (read carefully — Pulsar's KV schemas are unique).
- Subscription type lock-in —
Key_Sharedsemantics differ from any other broker; design carefully.
Community Health¶
- Active multi-vendor governance (StreamNative, Tencent, Alibaba, DataStax, Yahoo Japan, …).
- Regular Apache release cadence; LTS lines.
- Annual Pulsar Summit and active mailing lists.
- Several thousand stars on
apache/pulsarand many production references at scale (Yahoo Japan, Splunk, Tencent, Verizon).
Sources¶
- Apache Pulsar Documentation (queried via Context7
/websites/pulsar_apache_4_1_x). - pulsar-architecture overview.
- BookKeeper concepts.
- StreamNative blog.
- OpenMessaging Benchmark.
Open Questions¶
- For multi-region active-active, is Pulsar's geo-replication operationally simpler than Kafka MirrorMaker 2.0 + Cluster Linking — and at what RPO?
- What is the practical storage-cluster (BookKeeper) sizing rule for sustained 1 GB/s ingest with 30-day retention?
- For
Key_Sharedsubscriptions, what failure modes can violate key-locality (e.g. bookie loss + cursor recovery)? - For tiered storage, what is the time-to-first-byte for a consumer reading offloaded data vs hot storage?
- After IBM's DataStax acquisition, what is the long-term commercial roadmap for Astra Streaming?