Operations¶

Production deployment of Apache Pulsar — sizing brokers and bookies separately, geo-replication setup, tiered storage, troubleshooting, and a Commands & Recipes section using pulsar-admin and pulsar-client.

Deployment Patterns¶

Standalone (dev only)¶

docker run -it -p 6650:6650 -p 8080:8080 \
  --name pulsar-standalone \
  apachepulsar/pulsar:latest \
  bin/pulsar standalone

Production cluster¶

Three-tier: 3 brokers, 4–6 bookies, 3 ZK or etcd nodes.

Tier	Sizing
Brokers	3 nodes × 4 vCPU + 8 GB heap
Bookies	4–6 nodes × 4–8 vCPU + 16 GB; NVMe journal + bulk SSD ledger disks
ZK / etcd	3 nodes × 2 vCPU + 4 GB; small SSD
Configuration store	3 nodes (often shared with local ZK in single-cluster deployments)
Pulsar Functions Worker	2–3 nodes if you run Functions in non-broker mode

Kubernetes (Helm chart)¶

The Apache Helm chart pulsar-helm-chart and StreamNative's pulsar-operator are the two main paths.

helm repo add apache https://pulsar.apache.org/charts
helm install pulsar apache/pulsar \
  --namespace pulsar --create-namespace \
  --values prod-values.yaml

Key values.yaml choices: separate StatefulSets for brokers vs bookies, distinct PVC classes (NVMe for bookies, SSD for ZK).

Sizing Guidance¶

Resource	Guidance
Broker JVM heap	4–8 GB; leave OS page cache for ManagedLedger working set
Broker direct memory	`pulsar_max_direct_memory` ≥ 2× heap
Bookie journal disk	NVMe; size to 30 minutes of ingest at full throughput
Bookie ledger disk	bulk SSD; size to working set + retention
Bookie GC	Java G1; tune `bookkeeper.gc.thread.count`
ZK heap	2–4 GB; ZK data <1 GB typically
Network	10 GbE between brokers and bookies

Best Practices¶

Place brokers and bookies on different hosts. They contend for memory and CPU otherwise.
Use Wq=3, Aq=2 for general workloads; raise to Aq=3 for stricter durability.
Enable journal sync on bookies (journalSyncData=true); never disable for prod.
Set namespace-level retention before topics are created; changing later requires care.
Schema validation at the namespace level: is_allow_auto_update_schema=false for prod.
Use partitioned topics when single-topic throughput exceeds a single broker.
Tier old segments to S3 rather than over-provisioning bookie storage.
Don't run Pulsar Functions in broker mode at scale; use Functions Worker cluster.
Pin clients to API version; some Pulsar admin endpoints have evolved.
Backup ZK metadata regularly — a corrupted ZK will require restore.

Performance Tuning¶

Tunable	Effect
`managedLedgerCacheSizeMB`	Broker-side message cache; raise for replay-heavy workloads.
`dispatcherMaxRoundRobinBatchSize`	Shared subscription dispatch batch.
`bookkeeperWriteQuorum` / `AckQuorum`	Per-cluster durability vs latency trade.
`journalMaxGroupWaitMSec`	Bookie journal batch latency (default 1ms).
`brokerDeleteInactiveTopicsEnabled`	Auto-delete unused topics; off for stable apps.
`loadBalancerLoadSheddingStrategy`	Algorithm for redistributing topics under load.
`enableTLS` / `tlsRequireTrustedClientCertOnConnect`	Forces mTLS when set.
`pulsar_storage_offload_threshold_in_seconds`	When to offload to tiered storage.

Troubleshooting¶

Slow consumer / dispatcher backlog¶

Symptom: pulsar-admin topics stats <topic> shows msgRateOut < msgRateIn and growing msgBacklog.

Causes: undersized consumer count for Shared subs, slow downstream processing.

Fix: scale consumers, raise receiverQueueSize, or split topic by adding partitions.

Bookie auto-recovery stuck¶

Symptom: bookkeeper shell autorecovery shows pending replications.

Cause: auditor can't elect a leader, or insufficient bookies for under-replicated ledgers.

Fix:

bookkeeper shell autorecovery -enable
bookkeeper shell listunderreplicated
bookkeeper shell decommissionbookie  # if a bookie is permanently lost

ZK quorum loss¶

Symptom: brokers logging KeeperException; topic ownership churns.

Fix: restore ZK quorum first; brokers will reconcile. Don't restart brokers en masse — they re-acquire bundles via ZK.

Cursor lag (Shared subscription)¶

Symptom: consumers occasionally re-receive messages.

Cause: cursor checkpoint interval; default ack delay.

Fix: examine pulsar-admin topics subscriptions stats; raise cursor checkpoint frequency if needed.

Geo-replication lag¶

pulsar-admin topics stats persistent://my-tenant/ns-prod/orders
# look for replication.*.replicationBacklog

Lag often correlates with WAN RTT + remote-cluster broker load. Monitor pulsar_replication_backlog_size.

Pulsar Functions failing¶

pulsar-admin functions status \
  --tenant my-tenant --namespace ns-prod --name enrichment
pulsar-admin functions get-status ...

Cost Analysis¶

Cost	Driver
Compute	Brokers + bookies + ZK; brokers can be small, bookies should be beefy.
Storage	Bookie disks for hot tier, S3 for cold tier.
Network egress	Geo-replication and tiered storage uploads.
Operations	Pulsar's three-tier model needs more on-call expertise than Kafka or NATS.
Managed offerings	StreamNative Cloud / DataStax Astra Streaming reduce ops cost.

Commands & Recipes¶

Cluster bootstrap¶

# Initialize cluster metadata
bin/pulsar initialize-cluster-metadata \
  --cluster pulsar-cluster-1 \
  --metadata-store zk:zk1:2181,zk2:2181,zk3:2181 \
  --configuration-metadata-store zk:zk1:2181,zk2:2181,zk3:2181 \
  --web-service-url http://broker1:8080 \
  --broker-service-url pulsar://broker1:6650

# Start a bookie
bin/pulsar bookie

# Start a broker
bin/pulsar broker

Tenant + namespace management¶

pulsar-admin tenants create my-tenant
pulsar-admin namespaces create my-tenant/ns-prod \
  --bundles 16 \
  --clusters pulsar-cluster-1

# Set retention (size, time)
pulsar-admin namespaces set-retention my-tenant/ns-prod \
  --size 100G --time 720m

# Limits
pulsar-admin namespaces set-backlog-quota my-tenant/ns-prod \
  --limit 50G --policy producer_request_hold

# Schema validation
pulsar-admin namespaces set-schema-validation-enforce \
  --enable my-tenant/ns-prod

Topic management¶

pulsar-admin topics create-partitioned-topic \
  persistent://my-tenant/ns-prod/orders --partitions 12
pulsar-admin topics list my-tenant/ns-prod
pulsar-admin topics stats persistent://my-tenant/ns-prod/orders
pulsar-admin topics get-internal-stats persistent://my-tenant/ns-prod/orders

Producing / consuming¶

pulsar-client produce persistent://my-tenant/ns-prod/orders \
  --num-produce 1000 --messages "hello"

pulsar-client consume persistent://my-tenant/ns-prod/orders \
  --subscription-name orders-sub \
  --subscription-type Shared

Geo-replication¶

pulsar-admin namespaces set-clusters my-tenant/ns-prod \
  --clusters us-east,eu-west,ap-southeast
pulsar-admin namespaces set-replicator-dispatch-rate my-tenant/ns-prod \
  --msg-dispatch-rate 10000 --byte-dispatch-rate 10485760 --period 1

Tiered storage¶

pulsar-admin namespaces set-offload-policies my-tenant/ns-prod \
  --driver aws-s3 \
  --bucket pulsar-cold \
  --region us-east-1 \
  --offloadAfterThreshold 10GB \
  --offloadAfterElapsed 24h

Functions¶

pulsar-admin functions create \
  --tenant my-tenant --namespace ns-prod --name enrichment \
  --inputs persistent://my-tenant/ns-prod/orders \
  --output persistent://my-tenant/ns-prod/orders-enriched \
  --jar ./enrichment-1.0.jar \
  --classname com.example.Enrich \
  --parallelism 3

BookKeeper diagnostics¶

bookkeeper shell bookieinfo
bookkeeper shell bookiesanity
bookkeeper shell autorecovery -status

Prometheus¶

Brokers, bookies, and Functions all expose Prometheus on /metrics. Use the official Grafana dashboards or StreamNative's hosted versions.

Cross-references¶

messaging/pulsar/architecture — for the broker/bookie/metadata model you operate.
messaging/pulsar/security — for TLS/JWT/Athenz hardening.
messaging/index — for cross-broker comparison.