Skip to content

Operations

Production deployment, tuning, troubleshooting, and a Commands & Recipes section for rabbitmqctl, rabbitmq-diagnostics, and the management HTTP API.

Deployment Patterns

Three RabbitMQ nodes on separate hosts/AZs, joined into a cluster. Khepri Raft uses these three nodes; quorum queues replicate across all three.

Five-node cluster

For higher concurrency or when running quorum queues at scale, five nodes give more headroom and tolerate two failures at the cost of slightly slower fsync-quorum on writes.

Kubernetes (Cluster Operator)

Use the RabbitMQ Cluster Operator to declare clusters via CRDs. Pair with the Messaging Topology Operator to declare exchanges/queues/users in YAML.

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: prod
  namespace: rabbit
spec:
  replicas: 3
  resources:
    requests:
      cpu: 2
      memory: 4Gi
    limits:
      cpu: 4
      memory: 8Gi
  persistence:
    storage: 200Gi
    storageClassName: ssd
  rabbitmq:
    additionalConfig: |
      cluster_partition_handling = pause_minority
      vm_memory_high_watermark.relative = 0.4
      default_queue_type = quorum

Sizing

Resource Guidance
CPU 4–8 vCPUs per node typical; quorum queues are CPU-intensive on Raft
Memory 4–8 GB per node minimum; raise watermark only after profiling
Disk NVMe — fsync latency dominates quorum queue throughput
Network 10 GbE+ for replication; gigabit is fine for low-volume queues
File descriptors Bump nofile to ~64k+ for many connections
Erlang procs +P flag — defaults are usually OK

Best Practices

  • Default queue type to quorum (default_queue_type = quorum) for new vhosts.
  • Set x-delivery-limit on quorum queues to prevent poison-message loops.
  • Use prefetch (basic.qos) generously — a prefetch of 10–250 per consumer balances throughput and fairness.
  • Publisher confirms for any meaningful durability (channel.confirm_select / confirmCallback).
  • Streams over fanout when the consumer count is high and replay is needed.
  • Federation, not cluster, across WAN — clustering across high-latency links is unsupported.
  • One vhost per environment / tenant; don't share namespaces.
  • Enable Prometheus plugin (built-in in 4.x): rabbitmq-plugins enable rabbitmq_prometheus.
  • Connection-per-app, channels-per-thread — anti-pattern is one connection per request.
  • Quorum queue minimum size 3; x-quorum-initial-group-size: 3.
  • Run management UI on a dedicated network; don't expose :15672 to the internet.

Performance Tuning

Tunable Effect
vm_memory_high_watermark.relative Threshold above which producers are blocked. Default 0.4.
disk_free_limit.relative Producers blocked when free disk falls below this fraction of memory. Default 0.5.
channel_max Max channels per connection. Default 2047.
cluster_partition_handling pause_minority (recommended), autoheal, ignore. Becomes simpler under Khepri.
default_consumer_prefetch Per-consumer prefetch when not set explicitly.
loopback_users Restrict the guest user to localhost (default) — keep this.
tcp_listen_options Adjust nodelay, linger, send/recv buffers.
collect_statistics_interval Lower if mgmt UI lags; raise to reduce CPU on high-cardinality fleets.

Troubleshooting

Memory alarm — producers blocked

Symptom: publishers report connection.blocked; mgmt UI shows red node.

Causes: queue backlog, large in-memory classic queues, big mgmt-UI history retention.

Fixes: - rabbitmq-diagnostics memory_breakdown to identify the dominant consumer. - Drain a queue or migrate it to quorum (lazy disk semantics). - Lower collect_statistics_interval retention. - Raise the watermark only as a temporary measure.

Disk free alarm

rabbitmq-diagnostics check_alarms
df -h /var/lib/rabbitmq

Trim a stream's max_age or max_segment_size, or expand the volume.

Quorum queue stuck (no leader)

Symptom: rabbitmq-queues quorum_status NAME shows no leader.

Cause: insufficient nodes for Raft quorum (e.g. 1 of 3 reachable).

Fix: restore network connectivity, or, in last-resort recovery, rabbitmq-queues delete_member and re-add a fresh node. Avoid force_reset unless you're aware of the data-loss implications.

Slow consumer back-pressure

Use rabbitmqctl list_queues messages messages_ready messages_unacknowledged consumers consumer_capacity to find queues with low capacity. Low consumer capacity + high unacked = slow consumer.

Khepri membership drift

rabbitmqctl status                   # check Khepri members
rabbitmqctl forget_cluster_node NODE # remove a permanently dead node

MQTT 5 / WebSocket clients disconnecting

  • Check that rabbitmq_mqtt plugin version matches server.
  • Verify mqtt.listeners.tcp and mqtt.listeners.ssl are enabled.
  • For mass disconnects after upgrade, look at the client library's MQTT 5 vs 3.1.1 default — RabbitMQ supports both but config is per-listener.

Cost Analysis

Cost Driver
Compute Erlang VM is moderate; not negligible at idle.
Storage Quorum queue WAL fsync = burst writes; provision NVMe.
Memory Classic queues hold messages until paged out; quorum queues spool to disk by default.
Network egress Federation/Shovel cross-region links carry duplicate traffic.
Tanzu RabbitMQ Per-core licensing; sometimes cheaper than running ops yourself.
CloudAMQP Per-instance pricing scales linearly with throughput class.

Commands & Recipes

Bootstrap & cluster

# On node 1
rabbitmqctl status

# On node 2 — join node 1's cluster
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit@node1
rabbitmqctl start_app

Vhost & user setup

rabbitmqctl add_vhost prod --default-queue-type quorum
rabbitmqctl add_user app 'change-me-now'
rabbitmqctl set_user_tags app monitoring
rabbitmqctl set_permissions -p prod app '.*' '.*' '.*'

# OAuth 2.0 plugin (replace local users)
rabbitmq-plugins enable rabbitmq_auth_backend_oauth2

Declare a quorum queue + binding

rabbitmqadmin declare queue name=orders queue_type=quorum durable=true \
  arguments='{"x-delivery-limit":5,"x-max-length":1000000}'
rabbitmqadmin declare exchange name=orders.x type=topic durable=true
rabbitmqadmin declare binding source=orders.x destination=orders routing_key=orders.created.*

Declare a stream

rabbitmqadmin declare queue name=events queue_type=stream durable=true \
  arguments='{"x-max-length-bytes":50000000000,"x-stream-max-segment-size-bytes":500000000}'

Federation upstream

rabbitmqctl set_parameter federation-upstream us-prod \
  '{"uri":"amqps://app:[email protected]:5671","trust-user-id":true}'
rabbitmqctl set_policy federate-orders "^orders\." \
  '{"federation-upstream-set":"all"}' --apply-to exchanges

Diagnostics

rabbitmq-diagnostics status
rabbitmq-diagnostics memory_breakdown
rabbitmq-diagnostics check_alarms
rabbitmq-diagnostics check_running
rabbitmq-diagnostics observer            # Erlang interactive observer
rabbitmqctl list_queues name type messages messages_ready consumers
rabbitmqctl list_connections user host channels state
rabbitmq-queues quorum_status orders
rabbitmq-queues stream_status events

Prometheus + Grafana

rabbitmq-plugins enable rabbitmq_prometheus
curl http://node:15692/metrics       # default endpoint

Apply the official Grafana dashboards.

perf-test

docker run -it --rm pivotalrabbitmq/perf-test:latest \
  --uri amqp://app:pwd@host:5672 \
  --producers 10 --consumers 10 \
  --rate 10000 --confirm 100 \
  --queue-pattern 'q-%d' --queue-pattern-from 1 --queue-pattern-to 50 \
  --quorum-queue

Upgrade Strategy

  • Rolling upgrade within a minor (4.2.0 → 4.2.5): update node-by-node, check rabbitmq-diagnostics check_running before moving on.
  • Mixed-mode tolerance is N → N+1 minor only; avoid running 4.0 next to 4.2 in the same cluster.
  • Khepri migration: when migrating from Mnesia, run pre-flight rabbitmqctl enable_feature_flag khepri_db after all nodes are 4.0+. Once enabled, downgrade requires backup/restore.
  • Plugin compatibility: check the plugin's release notes before upgrading.

Cross-references