Operations¶
Production deployment, tuning, troubleshooting, and a Commands & Recipes section for rabbitmqctl, rabbitmq-diagnostics, and the management HTTP API.
Deployment Patterns¶
Three-node cluster (recommended baseline)¶
Three RabbitMQ nodes on separate hosts/AZs, joined into a cluster. Khepri Raft uses these three nodes; quorum queues replicate across all three.
Five-node cluster¶
For higher concurrency or when running quorum queues at scale, five nodes give more headroom and tolerate two failures at the cost of slightly slower fsync-quorum on writes.
Kubernetes (Cluster Operator)¶
Use the RabbitMQ Cluster Operator to declare clusters via CRDs. Pair with the Messaging Topology Operator to declare exchanges/queues/users in YAML.
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: prod
namespace: rabbit
spec:
replicas: 3
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 4
memory: 8Gi
persistence:
storage: 200Gi
storageClassName: ssd
rabbitmq:
additionalConfig: |
cluster_partition_handling = pause_minority
vm_memory_high_watermark.relative = 0.4
default_queue_type = quorum
Sizing¶
| Resource | Guidance |
|---|---|
| CPU | 4–8 vCPUs per node typical; quorum queues are CPU-intensive on Raft |
| Memory | 4–8 GB per node minimum; raise watermark only after profiling |
| Disk | NVMe — fsync latency dominates quorum queue throughput |
| Network | 10 GbE+ for replication; gigabit is fine for low-volume queues |
| File descriptors | Bump nofile to ~64k+ for many connections |
| Erlang procs | +P flag — defaults are usually OK |
Best Practices¶
- Default queue type to quorum (
default_queue_type = quorum) for new vhosts. - Set
x-delivery-limiton quorum queues to prevent poison-message loops. - Use prefetch (basic.qos) generously — a prefetch of 10–250 per consumer balances throughput and fairness.
- Publisher confirms for any meaningful durability (
channel.confirm_select/confirmCallback). - Streams over fanout when the consumer count is high and replay is needed.
- Federation, not cluster, across WAN — clustering across high-latency links is unsupported.
- One vhost per environment / tenant; don't share namespaces.
- Enable Prometheus plugin (built-in in 4.x):
rabbitmq-plugins enable rabbitmq_prometheus. - Connection-per-app, channels-per-thread — anti-pattern is one connection per request.
- Quorum queue minimum size 3;
x-quorum-initial-group-size: 3. - Run management UI on a dedicated network; don't expose
:15672to the internet.
Performance Tuning¶
| Tunable | Effect |
|---|---|
vm_memory_high_watermark.relative |
Threshold above which producers are blocked. Default 0.4. |
disk_free_limit.relative |
Producers blocked when free disk falls below this fraction of memory. Default 0.5. |
channel_max |
Max channels per connection. Default 2047. |
cluster_partition_handling |
pause_minority (recommended), autoheal, ignore. Becomes simpler under Khepri. |
default_consumer_prefetch |
Per-consumer prefetch when not set explicitly. |
loopback_users |
Restrict the guest user to localhost (default) — keep this. |
tcp_listen_options |
Adjust nodelay, linger, send/recv buffers. |
collect_statistics_interval |
Lower if mgmt UI lags; raise to reduce CPU on high-cardinality fleets. |
Troubleshooting¶
Memory alarm — producers blocked¶
Symptom: publishers report connection.blocked; mgmt UI shows red node.
Causes: queue backlog, large in-memory classic queues, big mgmt-UI history retention.
Fixes:
- rabbitmq-diagnostics memory_breakdown to identify the dominant consumer.
- Drain a queue or migrate it to quorum (lazy disk semantics).
- Lower collect_statistics_interval retention.
- Raise the watermark only as a temporary measure.
Disk free alarm¶
Trim a stream's max_age or max_segment_size, or expand the volume.
Quorum queue stuck (no leader)¶
Symptom: rabbitmq-queues quorum_status NAME shows no leader.
Cause: insufficient nodes for Raft quorum (e.g. 1 of 3 reachable).
Fix: restore network connectivity, or, in last-resort recovery, rabbitmq-queues delete_member and re-add a fresh node. Avoid force_reset unless you're aware of the data-loss implications.
Slow consumer back-pressure¶
Use rabbitmqctl list_queues messages messages_ready messages_unacknowledged consumers consumer_capacity to find queues with low capacity. Low consumer capacity + high unacked = slow consumer.
Khepri membership drift¶
rabbitmqctl status # check Khepri members
rabbitmqctl forget_cluster_node NODE # remove a permanently dead node
MQTT 5 / WebSocket clients disconnecting¶
- Check that
rabbitmq_mqttplugin version matches server. - Verify
mqtt.listeners.tcpandmqtt.listeners.sslare enabled. - For mass disconnects after upgrade, look at the client library's MQTT 5 vs 3.1.1 default — RabbitMQ supports both but config is per-listener.
Cost Analysis¶
| Cost | Driver |
|---|---|
| Compute | Erlang VM is moderate; not negligible at idle. |
| Storage | Quorum queue WAL fsync = burst writes; provision NVMe. |
| Memory | Classic queues hold messages until paged out; quorum queues spool to disk by default. |
| Network egress | Federation/Shovel cross-region links carry duplicate traffic. |
| Tanzu RabbitMQ | Per-core licensing; sometimes cheaper than running ops yourself. |
| CloudAMQP | Per-instance pricing scales linearly with throughput class. |
Commands & Recipes¶
Bootstrap & cluster¶
# On node 1
rabbitmqctl status
# On node 2 — join node 1's cluster
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit@node1
rabbitmqctl start_app
Vhost & user setup¶
rabbitmqctl add_vhost prod --default-queue-type quorum
rabbitmqctl add_user app 'change-me-now'
rabbitmqctl set_user_tags app monitoring
rabbitmqctl set_permissions -p prod app '.*' '.*' '.*'
# OAuth 2.0 plugin (replace local users)
rabbitmq-plugins enable rabbitmq_auth_backend_oauth2
Declare a quorum queue + binding¶
rabbitmqadmin declare queue name=orders queue_type=quorum durable=true \
arguments='{"x-delivery-limit":5,"x-max-length":1000000}'
rabbitmqadmin declare exchange name=orders.x type=topic durable=true
rabbitmqadmin declare binding source=orders.x destination=orders routing_key=orders.created.*
Declare a stream¶
rabbitmqadmin declare queue name=events queue_type=stream durable=true \
arguments='{"x-max-length-bytes":50000000000,"x-stream-max-segment-size-bytes":500000000}'
Federation upstream¶
rabbitmqctl set_parameter federation-upstream us-prod \
'{"uri":"amqps://app:[email protected]:5671","trust-user-id":true}'
rabbitmqctl set_policy federate-orders "^orders\." \
'{"federation-upstream-set":"all"}' --apply-to exchanges
Diagnostics¶
rabbitmq-diagnostics status
rabbitmq-diagnostics memory_breakdown
rabbitmq-diagnostics check_alarms
rabbitmq-diagnostics check_running
rabbitmq-diagnostics observer # Erlang interactive observer
rabbitmqctl list_queues name type messages messages_ready consumers
rabbitmqctl list_connections user host channels state
rabbitmq-queues quorum_status orders
rabbitmq-queues stream_status events
Prometheus + Grafana¶
Apply the official Grafana dashboards.
perf-test¶
docker run -it --rm pivotalrabbitmq/perf-test:latest \
--uri amqp://app:pwd@host:5672 \
--producers 10 --consumers 10 \
--rate 10000 --confirm 100 \
--queue-pattern 'q-%d' --queue-pattern-from 1 --queue-pattern-to 50 \
--quorum-queue
Upgrade Strategy¶
- Rolling upgrade within a minor (4.2.0 → 4.2.5): update node-by-node, check
rabbitmq-diagnostics check_runningbefore moving on. - Mixed-mode tolerance is N → N+1 minor only; avoid running 4.0 next to 4.2 in the same cluster.
- Khepri migration: when migrating from Mnesia, run pre-flight
rabbitmqctl enable_feature_flag khepri_dbafter all nodes are 4.0+. Once enabled, downgrade requires backup/restore. - Plugin compatibility: check the plugin's release notes before upgrading.
Cross-references¶
- messaging/rabbitmq/architecture — for understanding queue types you operate.
- messaging/rabbitmq/security — for OAuth 2.0, mTLS, and threat model.
- messaging/index — comparisons with Kafka / NATS / Redpanda / Pulsar.