Operations¶

Production deployment, tuning, troubleshooting, and a Commands & Recipes section for rabbitmqctl, rabbitmq-diagnostics, and the management HTTP API.

Deployment Patterns¶

Three-node cluster (recommended baseline)¶

Three RabbitMQ nodes on separate hosts/AZs, joined into a cluster. Khepri Raft uses these three nodes; quorum queues replicate across all three.

Five-node cluster¶

For higher concurrency or when running quorum queues at scale, five nodes give more headroom and tolerate two failures at the cost of slightly slower fsync-quorum on writes.

Kubernetes (Cluster Operator)¶

Use the RabbitMQ Cluster Operator to declare clusters via CRDs. Pair with the Messaging Topology Operator to declare exchanges/queues/users in YAML.

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: prod
  namespace: rabbit
spec:
  replicas: 3
  resources:
    requests:
      cpu: 2
      memory: 4Gi
    limits:
      cpu: 4
      memory: 8Gi
  persistence:
    storage: 200Gi
    storageClassName: ssd
  rabbitmq:
    additionalConfig: |
      cluster_partition_handling = pause_minority
      vm_memory_high_watermark.relative = 0.4
      default_queue_type = quorum

Sizing¶

Resource	Guidance
CPU	4–8 vCPUs per node typical; quorum queues are CPU-intensive on Raft
Memory	4–8 GB per node minimum; raise watermark only after profiling
Disk	NVMe — fsync latency dominates quorum queue throughput
Network	10 GbE+ for replication; gigabit is fine for low-volume queues
File descriptors	Bump `nofile` to ~64k+ for many connections
Erlang procs	`+P` flag — defaults are usually OK

Best Practices¶

Default queue type to quorum (default_queue_type = quorum) for new vhosts.
Set x-delivery-limit on quorum queues to prevent poison-message loops.
Use prefetch (basic.qos) generously — a prefetch of 10–250 per consumer balances throughput and fairness.
Publisher confirms for any meaningful durability (channel.confirm_select / confirmCallback).
Streams over fanout when the consumer count is high and replay is needed.
Federation, not cluster, across WAN — clustering across high-latency links is unsupported.
One vhost per environment / tenant; don't share namespaces.
Enable Prometheus plugin (built-in in 4.x): rabbitmq-plugins enable rabbitmq_prometheus.
Connection-per-app, channels-per-thread — anti-pattern is one connection per request.
Quorum queue minimum size 3; x-quorum-initial-group-size: 3.
Run management UI on a dedicated network; don't expose :15672 to the internet.

Performance Tuning¶

Tunable	Effect
`vm_memory_high_watermark.relative`	Threshold above which producers are blocked. Default 0.4.
`disk_free_limit.relative`	Producers blocked when free disk falls below this fraction of memory. Default 0.5.
`channel_max`	Max channels per connection. Default 2047.
`cluster_partition_handling`	`pause_minority` (recommended), `autoheal`, `ignore`. Becomes simpler under Khepri.
`default_consumer_prefetch`	Per-consumer prefetch when not set explicitly.
`loopback_users`	Restrict the `guest` user to `localhost` (default) — keep this.
`tcp_listen_options`	Adjust `nodelay`, `linger`, send/recv buffers.
`collect_statistics_interval`	Lower if mgmt UI lags; raise to reduce CPU on high-cardinality fleets.

Troubleshooting¶

Memory alarm — producers blocked¶

Symptom: publishers report connection.blocked; mgmt UI shows red node.

Causes: queue backlog, large in-memory classic queues, big mgmt-UI history retention.

Fixes: - rabbitmq-diagnostics memory_breakdown to identify the dominant consumer. - Drain a queue or migrate it to quorum (lazy disk semantics). - Lower collect_statistics_interval retention. - Raise the watermark only as a temporary measure.

Disk free alarm¶

rabbitmq-diagnostics check_alarms
df -h /var/lib/rabbitmq

Trim a stream's max_age or max_segment_size, or expand the volume.

Quorum queue stuck (no leader)¶

Symptom: rabbitmq-queues quorum_status NAME shows no leader.

Cause: insufficient nodes for Raft quorum (e.g. 1 of 3 reachable).

Fix: restore network connectivity, or, in last-resort recovery, rabbitmq-queues delete_member and re-add a fresh node. Avoid force_reset unless you're aware of the data-loss implications.

Slow consumer back-pressure¶

Use rabbitmqctl list_queues messages messages_ready messages_unacknowledged consumers consumer_capacity to find queues with low capacity. Low consumer capacity + high unacked = slow consumer.

Khepri membership drift¶

rabbitmqctl status                   # check Khepri members
rabbitmqctl forget_cluster_node NODE # remove a permanently dead node

MQTT 5 / WebSocket clients disconnecting¶

Check that rabbitmq_mqtt plugin version matches server.
Verify mqtt.listeners.tcp and mqtt.listeners.ssl are enabled.
For mass disconnects after upgrade, look at the client library's MQTT 5 vs 3.1.1 default — RabbitMQ supports both but config is per-listener.

Cost Analysis¶

Cost	Driver
Compute	Erlang VM is moderate; not negligible at idle.
Storage	Quorum queue WAL fsync = burst writes; provision NVMe.
Memory	Classic queues hold messages until paged out; quorum queues spool to disk by default.
Network egress	Federation/Shovel cross-region links carry duplicate traffic.
Tanzu RabbitMQ	Per-core licensing; sometimes cheaper than running ops yourself.
CloudAMQP	Per-instance pricing scales linearly with throughput class.

Commands & Recipes¶

Bootstrap & cluster¶

# On node 1
rabbitmqctl status

# On node 2 — join node 1's cluster
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit@node1
rabbitmqctl start_app

Vhost & user setup¶

rabbitmqctl add_vhost prod --default-queue-type quorum
rabbitmqctl add_user app 'change-me-now'
rabbitmqctl set_user_tags app monitoring
rabbitmqctl set_permissions -p prod app '.*' '.*' '.*'

# OAuth 2.0 plugin (replace local users)
rabbitmq-plugins enable rabbitmq_auth_backend_oauth2

Declare a quorum queue + binding¶

rabbitmqadmin declare queue name=orders queue_type=quorum durable=true \
  arguments='{"x-delivery-limit":5,"x-max-length":1000000}'
rabbitmqadmin declare exchange name=orders.x type=topic durable=true
rabbitmqadmin declare binding source=orders.x destination=orders routing_key=orders.created.*

Declare a stream¶

rabbitmqadmin declare queue name=events queue_type=stream durable=true \
  arguments='{"x-max-length-bytes":50000000000,"x-stream-max-segment-size-bytes":500000000}'

Federation upstream¶

rabbitmqctl set_parameter federation-upstream us-prod \
  '{"uri":"amqps://app:[email protected]:5671","trust-user-id":true}'
rabbitmqctl set_policy federate-orders "^orders\." \
  '{"federation-upstream-set":"all"}' --apply-to exchanges

Diagnostics¶

rabbitmq-diagnostics status
rabbitmq-diagnostics memory_breakdown
rabbitmq-diagnostics check_alarms
rabbitmq-diagnostics check_running
rabbitmq-diagnostics observer            # Erlang interactive observer
rabbitmqctl list_queues name type messages messages_ready consumers
rabbitmqctl list_connections user host channels state
rabbitmq-queues quorum_status orders
rabbitmq-queues stream_status events

Prometheus + Grafana¶

rabbitmq-plugins enable rabbitmq_prometheus
curl http://node:15692/metrics       # default endpoint

Apply the official Grafana dashboards.

perf-test¶

docker run -it --rm pivotalrabbitmq/perf-test:latest \
  --uri amqp://app:pwd@host:5672 \
  --producers 10 --consumers 10 \
  --rate 10000 --confirm 100 \
  --queue-pattern 'q-%d' --queue-pattern-from 1 --queue-pattern-to 50 \
  --quorum-queue

Upgrade Strategy¶

Rolling upgrade within a minor (4.2.0 → 4.2.5): update node-by-node, check rabbitmq-diagnostics check_running before moving on.
Mixed-mode tolerance is N → N+1 minor only; avoid running 4.0 next to 4.2 in the same cluster.
Khepri migration: when migrating from Mnesia, run pre-flight rabbitmqctl enable_feature_flag khepri_db after all nodes are 4.0+. Once enabled, downgrade requires backup/restore.
Plugin compatibility: check the plugin's release notes before upgrading.

Cross-references¶

messaging/rabbitmq/architecture — for understanding queue types you operate.
messaging/rabbitmq/security — for OAuth 2.0, mTLS, and threat model.
messaging/index — comparisons with Kafka / NATS / Redpanda / Pulsar.