What is the reason behind activating global overload protection

Since the last week we have clients complaining that they see disconnects every 30 minutes. When we look at one of our MQTT nodes and the logs we can see:

2023-09-25 14:09:05,259 INFO  - Limiting the connect-rate of listener 'tls-tcp-listener-8883' to '68' CONNECT/s, because 'MEDIUM' global overload protection was activated.
2023-09-25 14:09:10,061 INFO  - Limiting the connect-rate of listener 'tls-tcp-listener-8883' to '35' CONNECT/s, because 'MEDIUM_HIGH' global overload protection was activated.
2023-09-25 14:09:14,835 INFO  - 'HIGH' local overload protection activated because of '67114' total running tasks (SingleWriter: 67093, SingleWriterCallbacks: 21, Extension: 0, Persistence: 0, Netty: 0).
2023-*09-25 14:09:14,836 INFO  - Limiting the connect-rate of listener 'tls-tcp-listener-8883' to '19' CONNECT/s, because 'HIGH' global overload protection was activated.*
*2023-09-25 14:0*9:19,659 INFO  - 'HIGHER' local overload protection activated because of '78075' total running tasks (SingleWriter: 78022, SingleWriterCallbacks: 53, Extension: 0, Persistence: 0, Netty: 0).
2023-09-25 14:09:19,663 INFO  - Limiting the connect-rate of listener 'tls-tcp-listener-8883' to '11' CONNECT/s, because 'HIGHER' global overload protection was activated.
2023-09-25 14:09:24,632 INFO  - 'HIGHEST' local overload protection activated because of '89314' total running tasks (SingleWriter: 89304, SingleWriterCallbacks: 10, Extension: 0, Persistence: 0, Netty: 0).
2023-09-25 14:09:24,632 INFO  - Limiting the connect-rate of listener 'tls-tcp-listener-8883' to '7' CONNECT/s, because 'HIGHEST' global overload protection was activated.
2023-09-25 14:09:28,059 INFO  - Local overload protection deactivated
2023-09-25 14:10:09,331 INFO  - Stopped limiting the connect-rate of listener 'tls-tcp-listener-8883', because the global overload protection was deactivated.

I’m wondering what can cause this issue?

We are suspecting it can occur if our Queue Size increases drastically? Can the Shared Subscription Queue Size in that case also start to go up up as a result of the Queued messages increasing drastically?
image

How, if possible, can we adjust our MQTT configuration in order to prevent this from happening in the future? Can we set some limitations other than what we have currently in following settings:

 <lifetime>604800</lifetime> <!-- 7 days -->
    </client-event-history>

        <queued-messages>
            <max-queue-size>259200000</max-queue-size> <!-- 1/s x 3600 x 24 x 3 x 1000 -->
            <strategy>discard-oldest</strategy>
        </queued-messages>
        <session-expiry>
            <max-interval>259200</max-interval> <!-- 3600 x 24 x 3 (3 days) -->
        </session-expiry>
        <packets>
            <max-packet-size>262144</max-packet-size> <!-- 256KB -->
        </packets>

           <key>supervision.global.tasks.maximum</key>
           <value>100000</value>

           <key>supervision.global.tasks.minimum</key>
           <value>1000</value>

           <key>initial.client-credits.per-tick</key>
           <value>3000</value>

           <key>initial.client-credits.publish</key>
           <value>75000</value>

To sum up:
1. Can you let us know why overload protection was activated?
2. Is there a direct correlation between Queued Messages & Shared Subscription Queue Size ie. when the earlier starts to grow rapidly the second Queue also start to grow?
3. Can we adjust our config to prevent this from happening in the future, other than the obvious, to decrease the max-queue-size

Best regards
Ash

Hello @icc ,

Thank you for the outreach - we’d be happy to clarify! I’ll try to answer each question in order. To start off with, we do also have a portion of our documentation dedicated to cluster overload protection, and exactly how it functions, available here.

1.) As noted in the logs, this was activated due to 'HIGH' local overload protection activated because of '67114' total running tasks. This typically indicates that the broker had received more tasks, in this case SingleWriter tasks, than it was able to handle. This can be due to a number of things, but is usually the result of message throughput reaching a rate that the broker was unable to handle, more clients than are able to be supported with the provided resources have connected, or if using a clustered deployment, a new node joining or leaving the cluster.

2.) Queued messages are essentially outlined as messages that are set to be delivered to clients, but the clients have not yet been able to receive the message. This can be due to the subscribing client being offline, or simply having more messages than the client is able to process at one time. The same can occur for clients using shared subscriptions, as these shared subscription messages, depending on their QoS level, can also be queued for the subscribing clients. While these two metrics do show two distinct types of queued messages, when messages begin to queue for any clients - using shared subscriptions or otherwise - the queued message size will begin to grow. If the shared subscription queue size is growing as well, it is likely that these clients are also queueing messages.

3.) In order to prevent this in the future, we may need a better understanding of what actions specifically caused this singlewriter task spike - if there were any new clients joining, the message load increased, clients fell offline during periods of high volume, or if the broker itself is no longer able to keep up with the volume and additional resources need to be allocated. Once this has been identified, further configuration options, such as modifying how overload protection functions or reducing queue size per-client, can be adjusted to best create a stable environment.

Best,
Aaron from the HiveMQ Team