Architecture¶

Apache Pulsar's three-tier architecture: stateless brokers, BookKeeper bookies for storage, and ZooKeeper (or alternatives) for metadata.

Component Overview¶

flowchart TB
    Client["Producer / Consumer"]
    subgraph Brokers["Pulsar Brokers (stateless)"]
        Broker["PulsarBroker\n- ManagedLedger\n- ManagedCursor\n- NamespaceBundle owner\n- Subscriptions\n- Functions runtime"]
    end
    subgraph BK["Apache BookKeeper"]
        Bookie["BookKeeperBookie\n- Journal disk\n- Ledger disk\n- Garbage collection"]
        AutoRecovery["BK AutoRecovery\n(Auditor + Replication Worker)"]
    end
    subgraph Metadata["Metadata Layer"]
        ZK["ZooKeeper / etcd / RocksDB"]
        ConfigStore["Configuration Store\n(per-instance, global)"]
    end
    subgraph Tiered["Tiered Storage"]
        Offloader["TieredStorageOffloader"]
        Cloud["S3 / GCS / Azure Blob"]
    end
    Client --> Broker
    Broker --> Bookie
    Broker --> ZK
    Bookie --> ZK
    AutoRecovery --> ZK
    AutoRecovery --> Bookie
    Broker --> Offloader
    Offloader --> Cloud
    Broker -.cluster info.- ConfigStore

Layer responsibilities¶

Layer	Holds	State?
Broker	Topic ownership (via NamespaceBundles), subscriptions, ManagedLedger / ManagedCursor handles, Pulsar Functions instances	Stateless w.r.t. messages
BookKeeper bookies	Persistent message data as ledgers (sequences of entries)	Stateful
ZooKeeper / etcd	Cluster topology, ledger metadata, broker–bundle ownership, schemas	Stateful (small)
Configuration Store	Tenants, namespaces, policies, geo-replication topology	Stateful (small)
Tiered storage	Offloaded ledgers	Stateful (large)

Tenant / Namespace / Topic Hierarchy¶

my-tenant/
   ├── ns-prod/
   │     ├── persistent://my-tenant/ns-prod/orders   (partitioned 12 ways)
   │     └── persistent://my-tenant/ns-prod/audit-log
   └── ns-staging/
         └── non-persistent://my-tenant/ns-staging/debug

Tenant — security boundary; managed by an admin.
Namespace — policy boundary (retention, replication, schema, dispatch quotas).
Topic — actual message stream. persistent://... (durable) or non-persistent://... (best-effort).

Topic Storage Model — Ledgers and Segments¶

A topic's data is stored as a chain of ledgers (BookKeeper entities), each ledger split into segments. Brokers' ManagedLedger abstraction manages ledger creation, rollover, and trimming.

flowchart LR
    subgraph Topic["persistent://my-tenant/ns/orders"]
        L1["Ledger 1\n(closed)"]
        L2["Ledger 2\n(closed)"]
        L3["Ledger 3\n(open)"]
    end
    subgraph Bookies["BookKeeper bookies"]
        BkA["Bookie A"]
        BkB["Bookie B"]
        BkC["Bookie C"]
    end
    L1 --> BkA
    L1 --> BkB
    L1 --> BkC
    L2 --> BkA
    L2 --> BkB
    L2 --> BkC
    L3 --> BkA
    L3 --> BkB
    L3 --> BkC
    OffStore["Tiered storage S3"]
    L1 -. offload .-> OffStore

Write quorum / Ack quorum / Ensemble size¶

BookKeeper writes use three numbers (Eq, Wq, Aq):

Param	Meaning
Ensemble size (Eq)	Number of bookies that can hold this ledger.
Write quorum (Wq)	Number of bookies the write is striped across.
Ack quorum (Aq)	Number of bookies that must ack a write before it's considered durable.

Common production values: Eq=3, Wq=3, Aq=2. Performance and durability trade-offs flow from these settings.

Subscription Types¶

flowchart LR
    Topic["Topic"]
    Sub1["Subscription S1\n(Exclusive)"]
    Sub2["Subscription S2\n(Failover)"]
    Sub3["Subscription S3\n(Shared)"]
    Sub4["Subscription S4\n(Key_Shared)"]
    Topic --> Sub1
    Topic --> Sub2
    Topic --> Sub3
    Topic --> Sub4
    C1["Consumer A"]
    C2["Consumer B"]
    C3["Consumer C"]
    Sub1 --> C1
    Sub2 --> C1
    Sub2 -. standby .- C2
    Sub3 --> C1
    Sub3 --> C2
    Sub3 --> C3
    Sub4 -- "key=k1" --> C1
    Sub4 -- "key=k2" --> C2
    Sub4 -- "key=k3" --> C3

Type	Behavior
Exclusive	Only one consumer at a time; second consumer is rejected.
Failover	One active, others standby. New active picked on failure.
Shared (Round-robin)	Messages distributed across all consumers.
Key_Shared	Messages with the same key always route to the same consumer (sticky hash, auto-split).

Geo-Replication¶

flowchart LR
    subgraph US["us-east cluster"]
        BrokerUS["broker"]
        BookieUS["bookie"]
        ZkUS["ZK"]
    end
    subgraph EU["eu-west cluster"]
        BrokerEU["broker"]
        BookieEU["bookie"]
        ZkEU["ZK"]
    end
    subgraph APAC["ap-southeast cluster"]
        BrokerAP["broker"]
        BookieAP["bookie"]
        ZkAP["ZK"]
    end
    Global["Configuration Store\n(global ZK)"]
    ZkUS -.-> Global
    ZkEU -.-> Global
    ZkAP -.-> Global
    BrokerUS <-->|geo-replicate| BrokerEU
    BrokerEU <-->|geo-replicate| BrokerAP
    BrokerUS <-->|geo-replicate| BrokerAP

A namespace can be configured with a list of clusters; the brokers run a replicator subscriber that produces published messages into the remote cluster's topic. Each cluster keeps its own copy.

Mechanism summary:

Per-namespace replication clusters set via pulsar-admin namespaces set-clusters.
Replicated cursor tracks per-cluster delivery position.
Conflict resolution: last-write-wins on identical message ids; in practice, design topics to be partitioned per-source-cluster.

Pulsar Functions¶

flowchart LR
    Input["Input topic A"]
    Func["PulsarFunction\n(user code)"]
    Output["Output topic B"]
    DLQ["Dead-letter topic"]
    Input --> Func
    Func --> Output
    Func -. on error .-> DLQ

Functions can run inside the broker (lightweight) or as a separate Functions Worker cluster (production). Source/sink IO connectors live in the same runtime.

Schemas & Transactions¶

Schema Registry is per-namespace; supports Avro, JSON Schema, Protobuf, and key-value composites.
Transactions (Pulsar transactions) span produces + acks across multiple topics; transaction coordinator runs in a system topic.

Performance Characteristics¶

Workload	Notes
Single broker, R3 ledger, NVMe bookies	~150–300 MB/s sustained (varies wildly with hardware)
Cluster-wide aggregate	Linear with broker + bookie counts
Tiered-storage cold read	Adds object-store latency to first byte
Geo-replication lag	Typically WAN RTT + replicator commit interval
Pulsar Functions overhead	~1–10 ms/event for moderate transforms

OpenMessaging Benchmark publishes apples-to-apples Kafka/Pulsar/RabbitMQ comparisons. Trust on-hardware measurement above any vendor blog.

Comparison Hooks¶

vs Kafka — Pulsar's compute/storage split is its big differentiator; Kafka's single-binary model is simpler ops.
vs Redpanda — Redpanda is Kafka-API and single-binary; Pulsar is multi-tenant and multi-protocol but heavier.
vs NATS — both have multi-tenant designs; NATS uses accounts, Pulsar uses tenants/namespaces. Pulsar wins on durable-stream scale, NATS on latency.