MQTT Online Status

I have had a search on the CL forum but posts go back to '21 with MQTT TX V4.0.8, suggesting reliability issues with the Online status boolean within the Node Info folder.

I thought I would start here on the IA forum, to get feedback from people that may have solved what I am looking to achieve.

We are using V8.1.35 on both server and edge nodes with IIOT. TX V4.0.20.

We want a reliable way of knowing when an edge node is online or offline.

Every day at 9AM the operations team check the previous days history. All our other non-MQTT devices are ModbusTCP, so I have a script that pings the end device every 5 minutes, logs that to a DB, presents that to a realtime front end trend, and also spits out a PDF report at 9AM with the most "no response" pings at the top.

We currently have ~110 Modbus TCP devices and 11 MQTT Edge nodes. Every new site will be MQTT.

What do ye other folk do?

So, we have found the Node Info/Online is exactly as reliable as the NDEATH message coming from the MQTT broker.

Our system is 99.9% MQTT and I keep a full log of raw MQTT traffic on a workstation. We have definitely had Sparkplug nodes stuck "online" when they aren't supposed to be, and I've always found from digging through the raw MQTT captures that the NDEATH wasn't sent in those cases.

The biggest thing to adjust is the MQTT keep alive settings between the edge nodes and the broker. Make sure it's not too large. Smaller is better for quicker recognition that the node is offline, but adds extra traffic. Probably not an issue unless you're on a cellular connection or something. If you're not limited by battery power or metered connections, try going as low as 10-20 seconds.

On other occasions we have bumped into corner case bugs with our specific broker software. EMQX is fantastic at most things, but has a specific bug where it doesn't bother sending the NDEATH if the node reconnects immediately. Wouldn't be the issue in your case, but I mention it as an example of how you can't totally ignore broker issues as a source of the problem.

Thanks for your insight.

We are about 95% cellular, and the remainder is WAN.

We use a SIM pool with 2GB per SIM but that can go to 10GB on one SIM and then eats from the pool off the rest.

Is this still the case with EMQX? It looks like this has since been fixed. (As of sometime in late 2023 or early 2024).

I just tested on a newer 5.7.1 server and it appears to be working now. :+1:

I kinda stopped paying attention to my Github inbox for a while and missed when they fixed that. I should have noticed when I linked back to it the other week.

1 Like