Portal:Cloud VPS/Admin/Runbooks/RabbitmqNetworkPartition
Error / Incident
This alert fires when there's no longer consensus between the rabbitmq servers. This seems to happen now and then, for unexplained reasons: the servers can talk to their clients but not to each other. When this happens we get a lot of RPC and other messaging timeouts in OpenStack services.
A state of -1 means that the metric is not being collected. There may or may not be an actual network partition.
Debugging
This alert is based on the output of rabbitmqctl cluster_status. Here's what it looks like when everything is healthy:
andrew@cloudrabbit1003:~$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@cloudrabbit1003 ...
Basics
Cluster name: rabbit@cloudrabbit1003.wikimedia.org
Disk Nodes
rabbit@cloudrabbit1001
rabbit@cloudrabbit1002
rabbit@cloudrabbit1003
Running Nodes
rabbit@cloudrabbit1001
rabbit@cloudrabbit1002
rabbit@cloudrabbit1003
Versions
rabbit@cloudrabbit1001: RabbitMQ 3.9.13 on Erlang 24.2.1
rabbit@cloudrabbit1002: RabbitMQ 3.9.13 on Erlang 24.2.1
rabbit@cloudrabbit1003: RabbitMQ 3.9.13 on Erlang 24.2.1
Maintenance status
Node: rabbit@cloudrabbit1001, status: not under maintenance
Node: rabbit@cloudrabbit1002, status: not under maintenance
Node: rabbit@cloudrabbit1003, status: not under maintenance
Alarms
(none)
Network Partitions
(none)
Listeners
Node: rabbit@cloudrabbit1001, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1001, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1001, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1001, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1001, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Node: rabbit@cloudrabbit1002, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1002, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1002, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1002, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1002, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Node: rabbit@cloudrabbit1003, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1003, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1003, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1003, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1003, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Feature flags
Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: stream_queue, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled
Note that 'Network Partitions' shows as '(none)'. In case of a partition, that section will list the partitioned servers. By running cluster_health on all three nodes it should obvious which node has fallen out of consensus.
Most often this can be resolved on the failing host by restarting rabbit:
andrew@cloudcontrol2001-dev:~$ sudo rabbitmqctl stop_app
Stopping rabbit application on node rabbit@cloudcontrol2001-dev ...
andrew@cloudcontrol2001-dev:~$ sudo rabbitmqctl start_app
Starting node rabbit@cloudcontrol2001-dev ...
If that does not resolve the issue, it might be necessary to reset the failing node, or to reset the entire cluster.