Operations¶
Production guidance for running NATS Server and JetStream — sizing, deployment, tuning, troubleshooting, and a Commands & Recipes section with the nats, nsc, and nats-top CLIs.
Deployment Patterns¶
Single cluster (3 or 5 nodes)¶
The default footprint: an odd number of nats-servers in one cluster, full-mesh routes on port 6222, JetStream enabled, R3 streams.
flowchart LR
n1["n1\nnats-server"]
n2["n2\nnats-server"]
n3["n3\nnats-server"]
n1 <--> n2
n2 <--> n3
n1 <--> n3
Clients["Apps / Microservices"]
Clients --> n1
Clients --> n2
Clients --> n3
Supercluster (multi-region)¶
Two or more clusters connected by gateways. Useful when latency or compliance requires regional isolation but a unified subject space is wanted.
Hub + leaf nodes (edge / SaaS tenant)¶
Central hub cluster runs JetStream. Leaf nodes at the edge (factories, branches, customer sites) connect outbound on TCP 7422. Each leaf is account-scoped — the leaf operator decides what subjects to expose.
Kubernetes (Helm chart)¶
Use the official Helm chart at github.com/nats-io/k8s. The pattern:
- StatefulSet with persistent volumes for
store_dir. - Headless service for cluster routing.
- HPA only for non-JetStream workloads — never autoscale a JetStream-cluster member.
Sizing¶
| Resource | Guidance |
|---|---|
| CPU | 4–8 vCPUs per server is typical; Core NATS is rarely CPU-bound. |
| Memory | 8 GB baseline; raise for large interest graphs and JetStream caches. |
| Disk | NVMe for JetStream store_dir; watch fsync latency. |
| Network | 10 GbE+ for high-throughput JetStream; gateways tolerate higher latency. |
| Pods/StatefulSet | Always odd counts (3, 5) for Raft. |
Best Practices¶
- Subject design. Use a stable token order and avoid PII in subject tokens (subjects are not encrypted in the wire protocol metadata). Prefer hierarchical:
region.tenant.entity.action.id. - Retention. Pick exactly one of
max_age,max_bytes,max_msgsas the dominant constraint per stream; the others should be safety nets. - Replicas. R1 for development and edge; R3 for prod; R5 only for high-fan-in commands or critical KV.
- Pull consumers for load-balanced backends; push consumers for low-fan-out fan-out.
- System account ops user. Always create a dedicated
sys.credsfor monitoring tools. Do not share with normal apps. - Account isolation. One account per trust boundary (tenant, environment, team). Use exports/imports for the few flows that legitimately cross.
- Backups.
nats stream backupfor individual streams; back up the JWT account tree separately viansc list keys. - Don't run JetStream and Core-only nodes mixed inconsistently in one cluster — keeps Raft assignment predictable.
Performance Tuning¶
| Tunable | Where | Notes |
|---|---|---|
max_payload |
server config | Default 1 MB. Raise carefully — affects buffer sizing on every connection. |
write_deadline |
server config | Default 10s; lower if you'd rather drop slow consumers fast. |
max_pending (per consumer) |
client | Backpressure threshold for push subscribers. |
prevent_gateway_dial_through_routes |
server config | Forces gateway membership only via gossip — sometimes useful for predictable topology. |
cluster.no_advertise |
server config | Hide internal IPs from gossip when behind a load balancer. |
cipher_suites (TLS) |
server config | Restrict to AEAD ciphers only. |
jetstream.max_memory / jetstream.max_file |
server config | Per-server JetStream limits — prevent runaway streams. |
cluster.pool_size |
server config | Increase for high cross-route fan-out (default is OS auto). |
Troubleshooting¶
Slow consumer¶
Symptom: server logs Slow Consumer Detected on a subscriber; the connection is dropped.
Causes: slow downstream processing, undersized max_pending, or congested NIC.
Fixes:
- Raise consumer's pending_msgs limit.
- Use a pull consumer with explicit Fetch(n).
- Check nats-top output for Slow consumers and outbound buffer sizes.
- Audit synchronous downstream calls (DB, third-party HTTP) inside callbacks — move them off-thread.
JetStream R3 stream out of sync¶
Symptom: nats stream info shows Catchup for one peer; replication lag growing.
Causes: lagging follower, disk I/O contention, or transient network partition.
Fixes:
- Inspect peer state: nats stream cluster peer-info <stream>.
- Force a Raft step-down: nats stream cluster step-down <stream> to rebalance.
- If a peer is lost, scale the stream replicas: nats stream edit --replicas=3 <stream>.
- Check store_dir disk space and iostat for per-disk write latency.
Account JWT not reloaded¶
Symptom: edited account permissions via nsc edit but server still enforces old rules.
Cause: server has not pulled the new JWT.
Fix:
nsc push -A # push all accounts to the resolver
# or, with the embedded resolver:
nats account info --account=A
Cluster split brain (rare)¶
JetStream uses Raft so a true split brain is prevented; what looks like split-brain is usually two leaf domains pointing at different hubs. Verify via nats server list --js from a system-account user.
Cost Analysis¶
| Cost | Driver |
|---|---|
| Compute | Tiny — a 3-node cluster typically fits on three modest VMs. |
| Storage | JetStream sizing dominates; tier old streams via mirror to a cheaper "cold cluster" if you don't need replay throughput. |
| Network egress | Gateway and leaf links are the chief egress drivers. |
| Synadia Cloud | Per-connection + per-data; cheaper than self-hosting at low scale. |
Commands & Recipes¶
Cluster bootstrap (CLI-driven)¶
# Spin up an unsecured 3-node JetStream cluster on localhost (dev only)
nats-server --jetstream --cluster nats://0.0.0.0:6222 --cluster_name C1 --routes nats://localhost:6222 --port 4222 --server_name n1 --store_dir /tmp/n1
nats-server --jetstream --cluster nats://0.0.0.0:6223 --cluster_name C1 --routes nats://localhost:6222 --port 4223 --server_name n2 --store_dir /tmp/n2
nats-server --jetstream --cluster nats://0.0.0.0:6224 --cluster_name C1 --routes nats://localhost:6222 --port 4224 --server_name n3 --store_dir /tmp/n3
Operator + account bootstrap¶
nsc add operator -n DEMO --sys
nsc edit operator --account-jwt-server-url nats://localhost:4222
nsc add account -n APP
nsc add user -n service
nsc generate creds -a APP -n service > app-service.creds
nsc push -A
Stream + consumer¶
# Create a 3-replica stream over orders.>
nats stream add ORDERS \
--subjects "orders.>" --storage file --replicas 3 \
--retention limits --max-age 720h --max-bytes 50GB \
--discard old --dupe-window 2m --defaults
# Create a durable pull consumer
nats consumer add ORDERS workers \
--pull --filter "orders.created.>" \
--ack explicit --max-deliver 5 --replay instant --defaults
KV bucket¶
# Create a replicated KV bucket
nats kv add SESSIONS --replicas=3 --ttl=24h
# Put / Get
nats kv put SESSIONS user.42 '{"role":"admin"}'
nats kv get SESSIONS user.42
# Watch a key
nats kv watch SESSIONS user.42
Object Store¶
nats object add FIRMWARE --replicas=3
nats object put FIRMWARE ./build/firmware-v1.bin
nats object get FIRMWARE firmware-v1.bin
nats object info FIRMWARE
Diagnostics¶
# Live cluster overview
nats-top -s nats://nats-1:4222
# Detailed server stats
nats server info -s nats://nats-1:4222
# Per-stream / consumer state
nats stream report
nats consumer report ORDERS
# Backup / restore a stream
nats stream backup ORDERS ./orders.tgz
nats stream restore ./orders.tgz
# Bench
nats bench mybench --pub 4 --sub 4 --msgs 1000000 --size 256
Helm install (Kubernetes)¶
helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm upgrade --install nats nats/nats \
--set config.cluster.enabled=true \
--set config.cluster.replicas=3 \
--set config.jetstream.enabled=true \
--set config.jetstream.fileStore.pvc.size=200Gi \
--set config.jetstream.fileStore.pvc.storageClassName=ssd \
-n nats --create-namespace
Prometheus¶
nats-server exposes monitoring on :8222. Use the prometheus-nats-exporter:
prometheus-nats-exporter -varz -connz -routez -gatewayz -leafz -channelz -jsz=all http://nats-1:8222
Then scrape from Prometheus on :7777.
NACK (Kubernetes operator)¶
kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/crds.yml
kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/rbac.yml
kubectl apply -f https://raw.githubusercontent.com/nats-io/nack/main/deploy/jsc.yml
Then declare streams and consumers as CRDs (kind: Stream, kind: Consumer, kind: KeyValue, kind: ObjectStore).
Upgrade Strategy¶
- Rolling restart — drain leadership with
nats server raft step-down --cluster <name>before terminating each pod. - Server version skew — NATS supports N to N-1 mixed mode briefly during rolling upgrades; never run more than two minor versions apart in one cluster.
- JetStream protocol changes — read release notes for any behavior change to consumer or stream APIs.
Cross-references¶
- messaging/nats/architecture — for understanding the Raft groups and storage internals you are operating.
- messaging/nats/security — for hardening checklists used in production.
- messaging/index — domain hub, comparisons with Kafka / RabbitMQ / Redpanda / Pulsar.