Operations¶

Production guidance for running NATS Server and JetStream — sizing, deployment, tuning, troubleshooting, and a Commands & Recipes section with the nats, nsc, and nats-top CLIs.

Deployment Patterns¶

Single cluster (3 or 5 nodes)¶

The default footprint: an odd number of nats-servers in one cluster, full-mesh routes on port 6222, JetStream enabled, R3 streams.

flowchart LR
    n1["n1\nnats-server"]
    n2["n2\nnats-server"]
    n3["n3\nnats-server"]
    n1 <--> n2
    n2 <--> n3
    n1 <--> n3
    Clients["Apps / Microservices"]
    Clients --> n1
    Clients --> n2
    Clients --> n3

Supercluster (multi-region)¶

Two or more clusters connected by gateways. Useful when latency or compliance requires regional isolation but a unified subject space is wanted.

Hub + leaf nodes (edge / SaaS tenant)¶

Central hub cluster runs JetStream. Leaf nodes at the edge (factories, branches, customer sites) connect outbound on TCP 7422. Each leaf is account-scoped — the leaf operator decides what subjects to expose.

Kubernetes (Helm chart)¶

Use the official Helm chart at github.com/nats-io/k8s. The pattern:

StatefulSet with persistent volumes for store_dir.
Headless service for cluster routing.
HPA only for non-JetStream workloads — never autoscale a JetStream-cluster member.

Sizing¶

Resource	Guidance
CPU	4–8 vCPUs per server is typical; Core NATS is rarely CPU-bound.
Memory	8 GB baseline; raise for large interest graphs and JetStream caches.
Disk	NVMe for JetStream `store_dir`; watch fsync latency.
Network	10 GbE+ for high-throughput JetStream; gateways tolerate higher latency.
Pods/StatefulSet	Always odd counts (3, 5) for Raft.

Best Practices¶

Subject design. Use a stable token order and avoid PII in subject tokens (subjects are not encrypted in the wire protocol metadata). Prefer hierarchical: region.tenant.entity.action.id.
Retention. Pick exactly one of max_age, max_bytes, max_msgs as the dominant constraint per stream; the others should be safety nets.
Replicas. R1 for development and edge; R3 for prod; R5 only for high-fan-in commands or critical KV.
Pull consumers for load-balanced backends; push consumers for low-fan-out fan-out.
System account ops user. Always create a dedicated sys.creds for monitoring tools. Do not share with normal apps.
Account isolation. One account per trust boundary (tenant, environment, team). Use exports/imports for the few flows that legitimately cross.
Backups. nats stream backup for individual streams; back up the JWT account tree separately via nsc list keys.
Don't run JetStream and Core-only nodes mixed inconsistently in one cluster — keeps Raft assignment predictable.

Performance Tuning¶

Tunable	Where	Notes
`max_payload`	server config	Default 1 MB. Raise carefully — affects buffer sizing on every connection.
`write_deadline`	server config	Default 10s; lower if you'd rather drop slow consumers fast.
`max_pending` (per consumer)	client	Backpressure threshold for push subscribers.
`prevent_gateway_dial_through_routes`	server config	Forces gateway membership only via gossip — sometimes useful for predictable topology.
`cluster.no_advertise`	server config	Hide internal IPs from gossip when behind a load balancer.
`cipher_suites` (TLS)	server config	Restrict to AEAD ciphers only.
`jetstream.max_memory` / `jetstream.max_file`	server config	Per-server JetStream limits — prevent runaway streams.
`cluster.pool_size`	server config	Increase for high cross-route fan-out (default is OS auto).

Troubleshooting¶

Slow consumer¶

Symptom: server logs Slow Consumer Detected on a subscriber; the connection is dropped.

Causes: slow downstream processing, undersized max_pending, or congested NIC.

Fixes: - Raise consumer's pending_msgs limit. - Use a pull consumer with explicit Fetch(n). - Check nats-top output for Slow consumers and outbound buffer sizes. - Audit synchronous downstream calls (DB, third-party HTTP) inside callbacks — move them off-thread.

JetStream R3 stream out of sync¶

Symptom: nats stream info shows Catchup for one peer; replication lag growing.

Causes: lagging follower, disk I/O contention, or transient network partition.

Fixes: - Inspect peer state: nats stream cluster peer-info <stream>. - Force a Raft step-down: nats stream cluster step-down <stream> to rebalance. - If a peer is lost, scale the stream replicas: nats stream edit --replicas=3 <stream>. - Check store_dir disk space and iostat for per-disk write latency.

Account JWT not reloaded¶

Symptom: edited account permissions via nsc edit but server still enforces old rules.

Cause: server has not pulled the new JWT.

Fix:

nsc push -A           # push all accounts to the resolver
# or, with the embedded resolver:
nats account info --account=A

Cluster split brain (rare)¶

JetStream uses Raft so a true split brain is prevented; what looks like split-brain is usually two leaf domains pointing at different hubs. Verify via nats server list --js from a system-account user.

Cost Analysis¶

Cost	Driver
Compute	Tiny — a 3-node cluster typically fits on three modest VMs.
Storage	JetStream sizing dominates; tier old streams via `mirror` to a cheaper "cold cluster" if you don't need replay throughput.
Network egress	Gateway and leaf links are the chief egress drivers.
Synadia Cloud	Per-connection + per-data; cheaper than self-hosting at low scale.

Commands & Recipes¶

Cluster bootstrap (CLI-driven)¶

# Spin up an unsecured 3-node JetStream cluster on localhost (dev only)
nats-server --jetstream --cluster nats://0.0.0.0:6222 --cluster_name C1 --routes nats://localhost:6222 --port 4222 --server_name n1 --store_dir /tmp/n1
nats-server --jetstream --cluster nats://0.0.0.0:6223 --cluster_name C1 --routes nats://localhost:6222 --port 4223 --server_name n2 --store_dir /tmp/n2
nats-server --jetstream --cluster nats://0.0.0.0:6224 --cluster_name C1 --routes nats://localhost:6222 --port 4224 --server_name n3 --store_dir /tmp/n3

Operator + account bootstrap¶

nsc add operator -n DEMO --sys
nsc edit operator --account-jwt-server-url nats://localhost:4222
nsc add account -n APP
nsc add user -n service
nsc generate creds -a APP -n service > app-service.creds
nsc push -A

Stream + consumer¶

# Create a 3-replica stream over orders.>
nats stream add ORDERS \
  --subjects "orders.>" --storage file --replicas 3 \
  --retention limits --max-age 720h --max-bytes 50GB \
  --discard old --dupe-window 2m --defaults

# Create a durable pull consumer
nats consumer add ORDERS workers \
  --pull --filter "orders.created.>" \
  --ack explicit --max-deliver 5 --replay instant --defaults

KV bucket¶

# Create a replicated KV bucket
nats kv add SESSIONS --replicas=3 --ttl=24h
# Put / Get
nats kv put SESSIONS user.42 '{"role":"admin"}'
nats kv get SESSIONS user.42
# Watch a key
nats kv watch SESSIONS user.42

Object Store¶

nats object add FIRMWARE --replicas=3
nats object put FIRMWARE ./build/firmware-v1.bin
nats object get FIRMWARE firmware-v1.bin
nats object info FIRMWARE

Diagnostics¶

# Live cluster overview
nats-top -s nats://nats-1:4222

# Detailed server stats
nats server info -s nats://nats-1:4222

# Per-stream / consumer state
nats stream report
nats consumer report ORDERS

# Backup / restore a stream
nats stream backup ORDERS ./orders.tgz
nats stream restore ./orders.tgz

# Bench
nats bench mybench --pub 4 --sub 4 --msgs 1000000 --size 256

Helm install (Kubernetes)¶

helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm upgrade --install nats nats/nats \
  --set config.cluster.enabled=true \
  --set config.cluster.replicas=3 \
  --set config.jetstream.enabled=true \
  --set config.jetstream.fileStore.pvc.size=200Gi \
  --set config.jetstream.fileStore.pvc.storageClassName=ssd \
  -n nats --create-namespace

Prometheus¶

nats-server exposes monitoring on :8222. Use the prometheus-nats-exporter:

prometheus-nats-exporter -varz -connz -routez -gatewayz -leafz -channelz -jsz=all http://nats-1:8222

Then scrape from Prometheus on :7777.

NACK (Kubernetes operator)¶

kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/crds.yml
kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/rbac.yml
kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/jsc.yml

Then declare streams and consumers as CRDs (kind: Stream, kind: Consumer, kind: KeyValue, kind: ObjectStore).

Upgrade Strategy¶

Rolling restart — drain leadership with nats server raft step-down --cluster <name> before terminating each pod.
Server version skew — NATS supports N to N-1 mixed mode briefly during rolling upgrades; never run more than two minor versions apart in one cluster.
JetStream protocol changes — read release notes for any behavior change to consumer or stream APIs.

Cross-references¶

messaging/nats/architecture — for understanding the Raft groups and storage internals you are operating.
messaging/nats/security — for hardening checklists used in production.
messaging/index — domain hub, comparisons with Kafka / RabbitMQ / Redpanda / Pulsar.