Operations¶
Production deployment of Apache Pulsar — sizing brokers and bookies separately, geo-replication setup, tiered storage, troubleshooting, and a Commands & Recipes section using pulsar-admin and pulsar-client.
Deployment Patterns¶
Standalone (dev only)¶
docker run -it -p 6650:6650 -p 8080:8080 \
--name pulsar-standalone \
apachepulsar/pulsar:latest \
bin/pulsar standalone
Production cluster¶
Three-tier: 3 brokers, 4–6 bookies, 3 ZK or etcd nodes.
| Tier | Sizing |
|---|---|
| Brokers | 3 nodes × 4 vCPU + 8 GB heap |
| Bookies | 4–6 nodes × 4–8 vCPU + 16 GB; NVMe journal + bulk SSD ledger disks |
| ZK / etcd | 3 nodes × 2 vCPU + 4 GB; small SSD |
| Configuration store | 3 nodes (often shared with local ZK in single-cluster deployments) |
| Pulsar Functions Worker | 2–3 nodes if you run Functions in non-broker mode |
Kubernetes (Helm chart)¶
The Apache Helm chart pulsar-helm-chart and StreamNative's pulsar-operator are the two main paths.
helm repo add apache https://pulsar.apache.org/charts
helm install pulsar apache/pulsar \
--namespace pulsar --create-namespace \
--values prod-values.yaml
Key values.yaml choices: separate StatefulSets for brokers vs bookies, distinct PVC classes (NVMe for bookies, SSD for ZK).
Sizing Guidance¶
| Resource | Guidance |
|---|---|
| Broker JVM heap | 4–8 GB; leave OS page cache for ManagedLedger working set |
| Broker direct memory | pulsar_max_direct_memory ≥ 2× heap |
| Bookie journal disk | NVMe; size to 30 minutes of ingest at full throughput |
| Bookie ledger disk | bulk SSD; size to working set + retention |
| Bookie GC | Java G1; tune bookkeeper.gc.thread.count |
| ZK heap | 2–4 GB; ZK data <1 GB typically |
| Network | 10 GbE between brokers and bookies |
Best Practices¶
- Place brokers and bookies on different hosts. They contend for memory and CPU otherwise.
- Use
Wq=3, Aq=2for general workloads; raise toAq=3for stricter durability. - Enable journal sync on bookies (
journalSyncData=true); never disable for prod. - Set namespace-level retention before topics are created; changing later requires care.
- Schema validation at the namespace level:
is_allow_auto_update_schema=falsefor prod. - Use partitioned topics when single-topic throughput exceeds a single broker.
- Tier old segments to S3 rather than over-provisioning bookie storage.
- Don't run Pulsar Functions in broker mode at scale; use Functions Worker cluster.
- Pin clients to API version; some Pulsar admin endpoints have evolved.
- Backup ZK metadata regularly — a corrupted ZK will require restore.
Performance Tuning¶
| Tunable | Effect |
|---|---|
managedLedgerCacheSizeMB |
Broker-side message cache; raise for replay-heavy workloads. |
dispatcherMaxRoundRobinBatchSize |
Shared subscription dispatch batch. |
bookkeeperWriteQuorum / AckQuorum |
Per-cluster durability vs latency trade. |
journalMaxGroupWaitMSec |
Bookie journal batch latency (default 1ms). |
brokerDeleteInactiveTopicsEnabled |
Auto-delete unused topics; off for stable apps. |
loadBalancerLoadSheddingStrategy |
Algorithm for redistributing topics under load. |
enableTLS / tlsRequireTrustedClientCertOnConnect |
Forces mTLS when set. |
pulsar_storage_offload_threshold_in_seconds |
When to offload to tiered storage. |
Troubleshooting¶
Slow consumer / dispatcher backlog¶
Symptom: pulsar-admin topics stats <topic> shows msgRateOut < msgRateIn and growing msgBacklog.
Causes: undersized consumer count for Shared subs, slow downstream processing.
Fix: scale consumers, raise receiverQueueSize, or split topic by adding partitions.
Bookie auto-recovery stuck¶
Symptom: bookkeeper shell autorecovery shows pending replications.
Cause: auditor can't elect a leader, or insufficient bookies for under-replicated ledgers.
Fix:
bookkeeper shell autorecovery -enable
bookkeeper shell listunderreplicated
bookkeeper shell decommissionbookie # if a bookie is permanently lost
ZK quorum loss¶
Symptom: brokers logging KeeperException; topic ownership churns.
Fix: restore ZK quorum first; brokers will reconcile. Don't restart brokers en masse — they re-acquire bundles via ZK.
Cursor lag (Shared subscription)¶
Symptom: consumers occasionally re-receive messages.
Cause: cursor checkpoint interval; default ack delay.
Fix: examine pulsar-admin topics subscriptions stats; raise cursor checkpoint frequency if needed.
Geo-replication lag¶
pulsar-admin topics stats persistent://my-tenant/ns-prod/orders
# look for replication.*.replicationBacklog
Lag often correlates with WAN RTT + remote-cluster broker load. Monitor pulsar_replication_backlog_size.
Pulsar Functions failing¶
pulsar-admin functions status \
--tenant my-tenant --namespace ns-prod --name enrichment
pulsar-admin functions get-status ...
Cost Analysis¶
| Cost | Driver |
|---|---|
| Compute | Brokers + bookies + ZK; brokers can be small, bookies should be beefy. |
| Storage | Bookie disks for hot tier, S3 for cold tier. |
| Network egress | Geo-replication and tiered storage uploads. |
| Operations | Pulsar's three-tier model needs more on-call expertise than Kafka or NATS. |
| Managed offerings | StreamNative Cloud / DataStax Astra Streaming reduce ops cost. |
Commands & Recipes¶
Cluster bootstrap¶
# Initialize cluster metadata
bin/pulsar initialize-cluster-metadata \
--cluster pulsar-cluster-1 \
--metadata-store zk:zk1:2181,zk2:2181,zk3:2181 \
--configuration-metadata-store zk:zk1:2181,zk2:2181,zk3:2181 \
--web-service-url http://broker1:8080 \
--broker-service-url pulsar://broker1:6650
# Start a bookie
bin/pulsar bookie
# Start a broker
bin/pulsar broker
Tenant + namespace management¶
pulsar-admin tenants create my-tenant
pulsar-admin namespaces create my-tenant/ns-prod \
--bundles 16 \
--clusters pulsar-cluster-1
# Set retention (size, time)
pulsar-admin namespaces set-retention my-tenant/ns-prod \
--size 100G --time 720m
# Limits
pulsar-admin namespaces set-backlog-quota my-tenant/ns-prod \
--limit 50G --policy producer_request_hold
# Schema validation
pulsar-admin namespaces set-schema-validation-enforce \
--enable my-tenant/ns-prod
Topic management¶
pulsar-admin topics create-partitioned-topic \
persistent://my-tenant/ns-prod/orders --partitions 12
pulsar-admin topics list my-tenant/ns-prod
pulsar-admin topics stats persistent://my-tenant/ns-prod/orders
pulsar-admin topics get-internal-stats persistent://my-tenant/ns-prod/orders
Producing / consuming¶
pulsar-client produce persistent://my-tenant/ns-prod/orders \
--num-produce 1000 --messages "hello"
pulsar-client consume persistent://my-tenant/ns-prod/orders \
--subscription-name orders-sub \
--subscription-type Shared
Geo-replication¶
pulsar-admin namespaces set-clusters my-tenant/ns-prod \
--clusters us-east,eu-west,ap-southeast
pulsar-admin namespaces set-replicator-dispatch-rate my-tenant/ns-prod \
--msg-dispatch-rate 10000 --byte-dispatch-rate 10485760 --period 1
Tiered storage¶
pulsar-admin namespaces set-offload-policies my-tenant/ns-prod \
--driver aws-s3 \
--bucket pulsar-cold \
--region us-east-1 \
--offloadAfterThreshold 10GB \
--offloadAfterElapsed 24h
Functions¶
pulsar-admin functions create \
--tenant my-tenant --namespace ns-prod --name enrichment \
--inputs persistent://my-tenant/ns-prod/orders \
--output persistent://my-tenant/ns-prod/orders-enriched \
--jar ./enrichment-1.0.jar \
--classname com.example.Enrich \
--parallelism 3
BookKeeper diagnostics¶
Prometheus¶
Brokers, bookies, and Functions all expose Prometheus on /metrics. Use the official Grafana dashboards or StreamNative's hosted versions.
Cross-references¶
- messaging/pulsar/architecture — for the broker/bookie/metadata model you operate.
- messaging/pulsar/security — for TLS/JWT/Athenz hardening.
- messaging/index — for cross-broker comparison.