we run HiveMQ CE 2019.1 in a heavy load scenario and frequently receive these errors in the logfile:
2019-12-12 14:01:48,560 - Outgoing publish message was dropped. Receiving shared subscription group: xxx, topic: xxx, qos: 0, reason: The QoS 0 memory limit exceeded, size: 8,044,675,132 bytes, max: 8,044,675,072 bytes.
Is there a way to increase the QoS 0 memory limit? I couldn’t find anything in the documentation.
Any help is appreciated.
The maximum amount of memory that can be used for QoS=0 queuing is set to 1/4 of the available memory. You can increase it by increasing the memory of your machine.
Please note that HiveMQ does only queue QoS=0 in-memory as a courtesy in the case that the receiver’s TCP socket is not writable. According to the MQTT specification it would also be correct to simply drop the messages without any queuing.
It seems to me that the root cause for the dropped messages you observe are subscribers that are not consuming messages quickly enough or not as quickly as messages are being published to your broker.
I hope this helps.
Michael from the HiveMQ Team
thanks for your kind and quick response!
The host (it’s a VM) has 128GB of RAM of which only 50% is used (including HiveMQ). So having 1/4th of that for QoS 0 message buffer would already be an improvement!
Can you give us a hint, what could be a reason for that 8 GB cap? Is “available memory” and hence the limit of QoS0-memory determined only once during startup of the broker or continuously?
Here is top’s current memory line from the host:
KiB Mem : 13203732+total, 18012776 free, 66245228 used, 47779316 buff/cache
Basically, we also had the same idea, that there might be a bottleneck with the consuming client.
Esp. since in our scenario, 5000 msg/sec are published (via 20.000 permanent tcp connections), but only one subscriber is consuming them (data sink).
In a previous setup with a mosquitto broker, the subscribing client showed, that it could handle around 4000 msg/sec without problems, so if it wasn’t consuming fast enough, it should be fine to have a few of them instead of a single one.
Therefore, we changed to shared subscription and continued increasing the number of clients in the subscription group from 1 to 4 to 8 and finally to 12 (yesterday evening).
However, we still receive the same “memory exceeded” errors frequently.
Now I am wondering, why this kind of “load balancing” doesn’t take effect. Is there something to consider with shared subscriptions that’s maybe having a negative effect on the memory consumption?
Does it make sense to further increase the number of clients in the subscription group or do you maybe see a principal problem in this approach?
Your help is much appreciated!
I wasn’t precise here, with that I meant the heap you set for the JVM that will run HiveMQ (aka the -Xmx option). Out of this heap 1/4 is the maximum size of the QoS=0 queue.
About your consumption problem, I did some tests myself with the scenario you described and can’t reproduce your result.
Here is one test with one subscriber (I used the HiveMQ client) that can consume 5k msg/sec (qos0 and 100byte payload and messages were discarded after reception):
Or here is a test where I increased the payload to 1KB and used 5 instead of one shared subscribers:
It would be nice to know how you create shared subscriptions, because my assumption is currently that the problem lies there.
What you could do is:
download the HiveMQ Enterprise version
- start it
- connect your subscribers as you would do to the HiveMQ CE version
- open the Control Center (URL: localhost:8080 Credential: admin:admin) and go to Clients and load the snapshot
- click on a subscriber to open the client detail and check if they have a shared subscription
- Heap was 8GB so the maximum memory for qos=0 queue would be 2GB
- Clients and Broker instances were in the same network (low latency)
- Broker machine: 8 CPU and 16GB RAM
thanks a lot! That gives me some options to investigate and I will come back with the results after the weekend.
However, I don’t think there is a basic problem with the shared subscription, as the message rates of each of the 12 clients dropped to approx. 1/12th of the original one.
Additionally, I have split the clients on topic base now, instead of using a shared subscription, i.e. each subscribes a distinct subset of all topics. That lead to 5 clients consuming between 500 and 2.500 msg/sec.
But we still observe the problems of missing data (I have set loglevel back to info, so I can’t tell more details).
Cheers and have a nice weekend!
Btw: I think the two screenshots are identical.
Oh yes that was the same image.
Forget my question if you correctly use shared subscriptions, the log already states that it is one:
So my next question: How do your topics for the publishers look like? Has each publisher an unique topic it publishes to?
You also have a nice weekend!
to answer your question:
A publisher is a specific script running on a specific host to collect data. Multiple such scripts are running on each host, each script publishing multiple metrics. And each single metric has a distinct topic, containing identifiers for host, script, metric.
That gives the total of approx. 20k connections to the broker.
Now about the progress with analysing the problem:
First, I wanted to recover the broker back to a normal state without restarting it. What I did was:
- removing all unnecessary subscribers (some monitors were simply counting messages)
- changing the required subscriber (data sink) from shared subscription (group of 12 clients) to distinct topic subscriptions (9 topic-branches distributed onto 5 clients)
- disabled debug logging (switched to INFO)
However, the broker did not recover and still was losing messages at roughly the same intensity.
On monday, I finally restarted the broker in the exact same configuration and it went fine for the rest of the week.
Today, as an attempt to recreate the problem, I started 5 subscriber clients (monitors) on different hosts, which each subscribed to the full topic-set.
After ~3 hours, the first messages occurred in the log, stating that messages have been dropped due to tcp socket not writable.
So it seems, I am at least able to reproduce the issue.
As suggested by you, I downloaded the Enterprise version and set it up. However, in the log it stated that unlicensed version will only accept 25 clients, which will not be enough to reproduce the problem.
So I didn’t continue there.
Another thing I was trying, was to install the hivemq-influxdb extension to the HiveMQ CE with the hope to get some metrics out of the broker. But I couldn’t manage to get it running. I mean, the broker was running and I saw no errors whatsoever in the logs, but I simply see nothing at all in the logs (and no connection to our influxdb). But that’s a different story…
Wish you a merry christmas!
sorry for the delayed response.
The part with the enterprise version was just to check if you had shared subscriptions (which you could have verified with the Web Interface (Control Center)), but that wasn’t necessary see my comment before.
One of my next advise would have been to set up monitoring for HiveMQ so I’m happy to see that you are already working on it. Did you by chance install the influxdb extension while HiveMQ was running? This is not possible, you’ll need to restart the broker so that the extension is loaded.
Do you use Grafana for displaying the metrics? I could add the dashboard JSON file I used for the pictures I added.
The other point I wanted to ask, is it possible to add an MQTT client to the broker that also consumes all messages? It would be nice to know if this also occurs with other MQTT clients (i.e. hivemq-mqtt-client) acting as consumers. Of course it would be better if you could to this on a test environment ;).
Greetings and happy new year,
after having some quieter weeks with the broker, it had a complete outage on Monday. Unfortunately, we lost the logfile (it was >40GB) during the recovery, so I could not identify the original reason. I saw the “QoS0 memory limit exceeded” messages again, but that doesn’t really tell us anything we don’t already know.
However, I have now managed to get the influxdb-extension to run. It was only now that I realized, that it disabled itself by placing a DISABLED file into its directory.
Then I only had to solve the trust-issue for HTTPS communication to the database and voilà, we receive some metrics now.
We are indeed using Grafana for displaying metrics, so if you could provide the json for a dashboard, that would save me quite some time.
I will now wait whether the metrics give us a hint, when the problem occurs next time. And if not, I will do as you suggested and replace our client with the hivemq-mqtt-client to see if it behaves in the same way.
Thanks for your help - much appreciated!
unfortunately I can’t add a JSON or ZIP to a comment .
Now I could send it to you via email (for that you would need to PM me your email), or I could use something like Gofile.
Choice is up to you!