Redundancy failover

Hi everyone,

What are the conditions for a redundant pair of gateways to automatically failover ?
Is there a full list of these conditions in the documentation?

I believe this is what you are looking for:
Ignition Docs: Setting up Redundancy

From the documentation I undestrand that the backup assumes responsibility only if a failover timeout occurs :

Are there any other cases?

Manually triggered failover.

So beside the loss of network connection between the gateways, there is no other case where the failover is automaticcaly triggered ?

No. But keep in mind that the network check is traffic between the gateways over their connection. A gateway crash or freeze on either side would break the "connection". It isn't just a simple ping.

What are you hoping the answer is?

I thought it might exist an automatic monitoring of certain key parameters of the active gateway such as for instance CPU and RAM used by the active gateway. If these values raise a certain limit for a certain time then a failover could occur.
But if the failover only occurs after a network loss between the gateway (no matter the cause of the loss), the answer if fine, I just needed a confirmation.

Redundancy failover is not intended to be a load-balancing mechanism. It's all-or-nothing. If one server is overprovisioned, failover will just send all that load to the backup.

True load balancing is possible in Ignition, but the exact approach will depend on what you're trying to load balance - e.g. Vision is different from Perspective is different from tag execution load.

1 Like

We have a gateway that has been briefly failing over twice a day. Looking at the redundancy settings, I noticed the Ping Timeout is 300 and the Ping Max Missed is 10. These are probably the default settings, I didn’t notice that in the manual. This is only a 3 second window. Is it reasonable to increase these to a 30 second window. I know it’s not the root cause but it might help us isolate the problem and ride out whatever glitch is causing it. It always restores almost immediately. We recently upgraded to 8.3 but this was also happening with 8.1.36. We have a lot of tags ~ .5 million and mqtt etc. It looks like our memory is sufficient. Just looking for ideas.

The ping timeout setting is milliseconds, so 3/10 of a second. (The default.)

Having repeated 300ms ping delays suggests a few possibilities:

  • Overloaded system at the OS level slowing replies from the partner,
  • Overloaded system at the java level slowing processing of replies (java GC stalls, perhaps),
  • Overloaded or overcommitted hypervisor(s) starving one or the other redundant gateway,
  • Flaky network link dropping packets (perhaps poor Rapid Spanning Tree config, or plain Spanning Tree instead of RSTP),
  • Flaky VPN for a backup gateway that isn't located in the same facility as the master (don't do this!).

(With a really big system, I'd suspect GC stalls first.)

Maybe I buried the lead but I was really asking if it is reasonable to increase the ping timeout temporarily to help us troubleshoot. Correct me if I’m wrong but I was under the impression that if an event pauses the primary gateway for more than 3,000 milliseconds, it will trigger a failover. I was wondering if it would be reasonable to increase these temporarily to help find the root cause, since the failover itself causes all kinds of issues with MQTT etc.

I'm struggling to imagine how this would help find the problem.

That it mitigates the pain is indisputable, but I would venture that the pain may be necessary to pinpoint the problem.

I certainly would not increase the ping timeout until after checking off the list I posted.

Thanks. From other posts here we will probably start with adding this to ignition.conf
wrapper.java.additional.XX=-XX:MaxGCPauseMillis=200

Why?

Also, 200 is the default value for that setting.

Yeah good point, we found that from your post. Honestly, because I had two LLM’s analyze the system_logs.idb and that was one of the recommendations, but we haven’t changed anything yet.

This is probably more quickly useful, so you can see your stalls, if any:

wrapper.java.additional.9=-Xlog:gc:file=logs/gc.log:t,tags:filecount=5,filesize=4M

(MaxGCPauseMillis is a soft target, not a hard limit.)