Skip to content

Operations

Production deployment of Apache Pulsar — sizing brokers and bookies separately, geo-replication setup, tiered storage, troubleshooting, and a Commands & Recipes section using pulsar-admin and pulsar-client.

Deployment Patterns

Standalone (dev only)

docker run -it -p 6650:6650 -p 8080:8080 \
  --name pulsar-standalone \
  apachepulsar/pulsar:latest \
  bin/pulsar standalone

Production cluster

Three-tier: 3 brokers, 4–6 bookies, 3 ZK or etcd nodes.

Tier Sizing
Brokers 3 nodes × 4 vCPU + 8 GB heap
Bookies 4–6 nodes × 4–8 vCPU + 16 GB; NVMe journal + bulk SSD ledger disks
ZK / etcd 3 nodes × 2 vCPU + 4 GB; small SSD
Configuration store 3 nodes (often shared with local ZK in single-cluster deployments)
Pulsar Functions Worker 2–3 nodes if you run Functions in non-broker mode

Kubernetes (Helm chart)

The Apache Helm chart pulsar-helm-chart and StreamNative's pulsar-operator are the two main paths.

helm repo add apache https://pulsar.apache.org/charts
helm install pulsar apache/pulsar \
  --namespace pulsar --create-namespace \
  --values prod-values.yaml

Key values.yaml choices: separate StatefulSets for brokers vs bookies, distinct PVC classes (NVMe for bookies, SSD for ZK).

Sizing Guidance

Resource Guidance
Broker JVM heap 4–8 GB; leave OS page cache for ManagedLedger working set
Broker direct memory pulsar_max_direct_memory ≥ 2× heap
Bookie journal disk NVMe; size to 30 minutes of ingest at full throughput
Bookie ledger disk bulk SSD; size to working set + retention
Bookie GC Java G1; tune bookkeeper.gc.thread.count
ZK heap 2–4 GB; ZK data <1 GB typically
Network 10 GbE between brokers and bookies

Best Practices

  • Place brokers and bookies on different hosts. They contend for memory and CPU otherwise.
  • Use Wq=3, Aq=2 for general workloads; raise to Aq=3 for stricter durability.
  • Enable journal sync on bookies (journalSyncData=true); never disable for prod.
  • Set namespace-level retention before topics are created; changing later requires care.
  • Schema validation at the namespace level: is_allow_auto_update_schema=false for prod.
  • Use partitioned topics when single-topic throughput exceeds a single broker.
  • Tier old segments to S3 rather than over-provisioning bookie storage.
  • Don't run Pulsar Functions in broker mode at scale; use Functions Worker cluster.
  • Pin clients to API version; some Pulsar admin endpoints have evolved.
  • Backup ZK metadata regularly — a corrupted ZK will require restore.

Performance Tuning

Tunable Effect
managedLedgerCacheSizeMB Broker-side message cache; raise for replay-heavy workloads.
dispatcherMaxRoundRobinBatchSize Shared subscription dispatch batch.
bookkeeperWriteQuorum / AckQuorum Per-cluster durability vs latency trade.
journalMaxGroupWaitMSec Bookie journal batch latency (default 1ms).
brokerDeleteInactiveTopicsEnabled Auto-delete unused topics; off for stable apps.
loadBalancerLoadSheddingStrategy Algorithm for redistributing topics under load.
enableTLS / tlsRequireTrustedClientCertOnConnect Forces mTLS when set.
pulsar_storage_offload_threshold_in_seconds When to offload to tiered storage.

Troubleshooting

Slow consumer / dispatcher backlog

Symptom: pulsar-admin topics stats <topic> shows msgRateOut < msgRateIn and growing msgBacklog.

Causes: undersized consumer count for Shared subs, slow downstream processing.

Fix: scale consumers, raise receiverQueueSize, or split topic by adding partitions.

Bookie auto-recovery stuck

Symptom: bookkeeper shell autorecovery shows pending replications.

Cause: auditor can't elect a leader, or insufficient bookies for under-replicated ledgers.

Fix:

bookkeeper shell autorecovery -enable
bookkeeper shell listunderreplicated
bookkeeper shell decommissionbookie  # if a bookie is permanently lost

ZK quorum loss

Symptom: brokers logging KeeperException; topic ownership churns.

Fix: restore ZK quorum first; brokers will reconcile. Don't restart brokers en masse — they re-acquire bundles via ZK.

Cursor lag (Shared subscription)

Symptom: consumers occasionally re-receive messages.

Cause: cursor checkpoint interval; default ack delay.

Fix: examine pulsar-admin topics subscriptions stats; raise cursor checkpoint frequency if needed.

Geo-replication lag

pulsar-admin topics stats persistent://my-tenant/ns-prod/orders
# look for replication.*.replicationBacklog

Lag often correlates with WAN RTT + remote-cluster broker load. Monitor pulsar_replication_backlog_size.

Pulsar Functions failing

pulsar-admin functions status \
  --tenant my-tenant --namespace ns-prod --name enrichment
pulsar-admin functions get-status ...

Cost Analysis

Cost Driver
Compute Brokers + bookies + ZK; brokers can be small, bookies should be beefy.
Storage Bookie disks for hot tier, S3 for cold tier.
Network egress Geo-replication and tiered storage uploads.
Operations Pulsar's three-tier model needs more on-call expertise than Kafka or NATS.
Managed offerings StreamNative Cloud / DataStax Astra Streaming reduce ops cost.

Commands & Recipes

Cluster bootstrap

# Initialize cluster metadata
bin/pulsar initialize-cluster-metadata \
  --cluster pulsar-cluster-1 \
  --metadata-store zk:zk1:2181,zk2:2181,zk3:2181 \
  --configuration-metadata-store zk:zk1:2181,zk2:2181,zk3:2181 \
  --web-service-url http://broker1:8080 \
  --broker-service-url pulsar://broker1:6650

# Start a bookie
bin/pulsar bookie

# Start a broker
bin/pulsar broker

Tenant + namespace management

pulsar-admin tenants create my-tenant
pulsar-admin namespaces create my-tenant/ns-prod \
  --bundles 16 \
  --clusters pulsar-cluster-1

# Set retention (size, time)
pulsar-admin namespaces set-retention my-tenant/ns-prod \
  --size 100G --time 720m

# Limits
pulsar-admin namespaces set-backlog-quota my-tenant/ns-prod \
  --limit 50G --policy producer_request_hold

# Schema validation
pulsar-admin namespaces set-schema-validation-enforce \
  --enable my-tenant/ns-prod

Topic management

pulsar-admin topics create-partitioned-topic \
  persistent://my-tenant/ns-prod/orders --partitions 12
pulsar-admin topics list my-tenant/ns-prod
pulsar-admin topics stats persistent://my-tenant/ns-prod/orders
pulsar-admin topics get-internal-stats persistent://my-tenant/ns-prod/orders

Producing / consuming

pulsar-client produce persistent://my-tenant/ns-prod/orders \
  --num-produce 1000 --messages "hello"

pulsar-client consume persistent://my-tenant/ns-prod/orders \
  --subscription-name orders-sub \
  --subscription-type Shared

Geo-replication

pulsar-admin namespaces set-clusters my-tenant/ns-prod \
  --clusters us-east,eu-west,ap-southeast
pulsar-admin namespaces set-replicator-dispatch-rate my-tenant/ns-prod \
  --msg-dispatch-rate 10000 --byte-dispatch-rate 10485760 --period 1

Tiered storage

pulsar-admin namespaces set-offload-policies my-tenant/ns-prod \
  --driver aws-s3 \
  --bucket pulsar-cold \
  --region us-east-1 \
  --offloadAfterThreshold 10GB \
  --offloadAfterElapsed 24h

Functions

pulsar-admin functions create \
  --tenant my-tenant --namespace ns-prod --name enrichment \
  --inputs persistent://my-tenant/ns-prod/orders \
  --output persistent://my-tenant/ns-prod/orders-enriched \
  --jar ./enrichment-1.0.jar \
  --classname com.example.Enrich \
  --parallelism 3

BookKeeper diagnostics

bookkeeper shell bookieinfo
bookkeeper shell bookiesanity
bookkeeper shell autorecovery -status

Prometheus

Brokers, bookies, and Functions all expose Prometheus on /metrics. Use the official Grafana dashboards or StreamNative's hosted versions.

Cross-references