Skip to content

Operations

Production guidance for running NATS Server and JetStream — sizing, deployment, tuning, troubleshooting, and a Commands & Recipes section with the nats, nsc, and nats-top CLIs.

Deployment Patterns

Single cluster (3 or 5 nodes)

The default footprint: an odd number of nats-servers in one cluster, full-mesh routes on port 6222, JetStream enabled, R3 streams.

flowchart LR
    n1["n1\nnats-server"]
    n2["n2\nnats-server"]
    n3["n3\nnats-server"]
    n1 <--> n2
    n2 <--> n3
    n1 <--> n3
    Clients["Apps / Microservices"]
    Clients --> n1
    Clients --> n2
    Clients --> n3

Supercluster (multi-region)

Two or more clusters connected by gateways. Useful when latency or compliance requires regional isolation but a unified subject space is wanted.

Hub + leaf nodes (edge / SaaS tenant)

Central hub cluster runs JetStream. Leaf nodes at the edge (factories, branches, customer sites) connect outbound on TCP 7422. Each leaf is account-scoped — the leaf operator decides what subjects to expose.

Kubernetes (Helm chart)

Use the official Helm chart at github.com/nats-io/k8s. The pattern:

  • StatefulSet with persistent volumes for store_dir.
  • Headless service for cluster routing.
  • HPA only for non-JetStream workloads — never autoscale a JetStream-cluster member.

Sizing

Resource Guidance
CPU 4–8 vCPUs per server is typical; Core NATS is rarely CPU-bound.
Memory 8 GB baseline; raise for large interest graphs and JetStream caches.
Disk NVMe for JetStream store_dir; watch fsync latency.
Network 10 GbE+ for high-throughput JetStream; gateways tolerate higher latency.
Pods/StatefulSet Always odd counts (3, 5) for Raft.

Best Practices

  • Subject design. Use a stable token order and avoid PII in subject tokens (subjects are not encrypted in the wire protocol metadata). Prefer hierarchical: region.tenant.entity.action.id.
  • Retention. Pick exactly one of max_age, max_bytes, max_msgs as the dominant constraint per stream; the others should be safety nets.
  • Replicas. R1 for development and edge; R3 for prod; R5 only for high-fan-in commands or critical KV.
  • Pull consumers for load-balanced backends; push consumers for low-fan-out fan-out.
  • System account ops user. Always create a dedicated sys.creds for monitoring tools. Do not share with normal apps.
  • Account isolation. One account per trust boundary (tenant, environment, team). Use exports/imports for the few flows that legitimately cross.
  • Backups. nats stream backup for individual streams; back up the JWT account tree separately via nsc list keys.
  • Don't run JetStream and Core-only nodes mixed inconsistently in one cluster — keeps Raft assignment predictable.

Performance Tuning

Tunable Where Notes
max_payload server config Default 1 MB. Raise carefully — affects buffer sizing on every connection.
write_deadline server config Default 10s; lower if you'd rather drop slow consumers fast.
max_pending (per consumer) client Backpressure threshold for push subscribers.
prevent_gateway_dial_through_routes server config Forces gateway membership only via gossip — sometimes useful for predictable topology.
cluster.no_advertise server config Hide internal IPs from gossip when behind a load balancer.
cipher_suites (TLS) server config Restrict to AEAD ciphers only.
jetstream.max_memory / jetstream.max_file server config Per-server JetStream limits — prevent runaway streams.
cluster.pool_size server config Increase for high cross-route fan-out (default is OS auto).

Troubleshooting

Slow consumer

Symptom: server logs Slow Consumer Detected on a subscriber; the connection is dropped.

Causes: slow downstream processing, undersized max_pending, or congested NIC.

Fixes: - Raise consumer's pending_msgs limit. - Use a pull consumer with explicit Fetch(n). - Check nats-top output for Slow consumers and outbound buffer sizes. - Audit synchronous downstream calls (DB, third-party HTTP) inside callbacks — move them off-thread.

JetStream R3 stream out of sync

Symptom: nats stream info shows Catchup for one peer; replication lag growing.

Causes: lagging follower, disk I/O contention, or transient network partition.

Fixes: - Inspect peer state: nats stream cluster peer-info <stream>. - Force a Raft step-down: nats stream cluster step-down <stream> to rebalance. - If a peer is lost, scale the stream replicas: nats stream edit --replicas=3 <stream>. - Check store_dir disk space and iostat for per-disk write latency.

Account JWT not reloaded

Symptom: edited account permissions via nsc edit but server still enforces old rules.

Cause: server has not pulled the new JWT.

Fix:

nsc push -A           # push all accounts to the resolver
# or, with the embedded resolver:
nats account info --account=A

Cluster split brain (rare)

JetStream uses Raft so a true split brain is prevented; what looks like split-brain is usually two leaf domains pointing at different hubs. Verify via nats server list --js from a system-account user.

Cost Analysis

Cost Driver
Compute Tiny — a 3-node cluster typically fits on three modest VMs.
Storage JetStream sizing dominates; tier old streams via mirror to a cheaper "cold cluster" if you don't need replay throughput.
Network egress Gateway and leaf links are the chief egress drivers.
Synadia Cloud Per-connection + per-data; cheaper than self-hosting at low scale.

Commands & Recipes

Cluster bootstrap (CLI-driven)

# Spin up an unsecured 3-node JetStream cluster on localhost (dev only)
nats-server --jetstream --cluster nats://0.0.0.0:6222 --cluster_name C1 --routes nats://localhost:6222 --port 4222 --server_name n1 --store_dir /tmp/n1
nats-server --jetstream --cluster nats://0.0.0.0:6223 --cluster_name C1 --routes nats://localhost:6222 --port 4223 --server_name n2 --store_dir /tmp/n2
nats-server --jetstream --cluster nats://0.0.0.0:6224 --cluster_name C1 --routes nats://localhost:6222 --port 4224 --server_name n3 --store_dir /tmp/n3

Operator + account bootstrap

nsc add operator -n DEMO --sys
nsc edit operator --account-jwt-server-url nats://localhost:4222
nsc add account -n APP
nsc add user -n service
nsc generate creds -a APP -n service > app-service.creds
nsc push -A

Stream + consumer

# Create a 3-replica stream over orders.>
nats stream add ORDERS \
  --subjects "orders.>" --storage file --replicas 3 \
  --retention limits --max-age 720h --max-bytes 50GB \
  --discard old --dupe-window 2m --defaults

# Create a durable pull consumer
nats consumer add ORDERS workers \
  --pull --filter "orders.created.>" \
  --ack explicit --max-deliver 5 --replay instant --defaults

KV bucket

# Create a replicated KV bucket
nats kv add SESSIONS --replicas=3 --ttl=24h
# Put / Get
nats kv put SESSIONS user.42 '{"role":"admin"}'
nats kv get SESSIONS user.42
# Watch a key
nats kv watch SESSIONS user.42

Object Store

nats object add FIRMWARE --replicas=3
nats object put FIRMWARE ./build/firmware-v1.bin
nats object get FIRMWARE firmware-v1.bin
nats object info FIRMWARE

Diagnostics

# Live cluster overview
nats-top -s nats://nats-1:4222

# Detailed server stats
nats server info -s nats://nats-1:4222

# Per-stream / consumer state
nats stream report
nats consumer report ORDERS

# Backup / restore a stream
nats stream backup ORDERS ./orders.tgz
nats stream restore ./orders.tgz

# Bench
nats bench mybench --pub 4 --sub 4 --msgs 1000000 --size 256

Helm install (Kubernetes)

helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm upgrade --install nats nats/nats \
  --set config.cluster.enabled=true \
  --set config.cluster.replicas=3 \
  --set config.jetstream.enabled=true \
  --set config.jetstream.fileStore.pvc.size=200Gi \
  --set config.jetstream.fileStore.pvc.storageClassName=ssd \
  -n nats --create-namespace

Prometheus

nats-server exposes monitoring on :8222. Use the prometheus-nats-exporter:

prometheus-nats-exporter -varz -connz -routez -gatewayz -leafz -channelz -jsz=all http://nats-1:8222

Then scrape from Prometheus on :7777.

NACK (Kubernetes operator)

kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/crds.yml
kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/rbac.yml
kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/jsc.yml

Then declare streams and consumers as CRDs (kind: Stream, kind: Consumer, kind: KeyValue, kind: ObjectStore).

Upgrade Strategy

  • Rolling restart — drain leadership with nats server raft step-down --cluster <name> before terminating each pod.
  • Server version skew — NATS supports N to N-1 mixed mode briefly during rolling upgrades; never run more than two minor versions apart in one cluster.
  • JetStream protocol changes — read release notes for any behavior change to consumer or stream APIs.

Cross-references