Incorrect redundant gateway routing - fixed in 7.9.14

driverk · April 30, 2020, 6:00pm

In the 7.9.14 release notes:

Fixed an issue where routing to redundant remote gateways can end up targeting the incorrect gateway based on disconnect/reconnect timing.<<

Is this issue long-standing in the 7.9.x line? I am asking because we may be a related problem in 7.9.6. In some odd cases, redundancy can detect a primary node problem, switch to backup, then quickly switch back to the primary.

This leads to a situation where we have to reset some components such as tag providers in order to resume normal operation.

salbrechtsen · May 4, 2020, 4:19pm

I’m wondering the same thing. We’ve seen redundancy toggling as well on 7.9.9.

PGriffith · May 4, 2020, 4:39pm

Yes, it’s a longstanding issue that affects basically the entire 7.9 and 8.0 line(s) without the fix.

driverk · May 4, 2020, 4:44pm

Thanks, PGriffith! We will make it a priority to qualify this release for deployment.

salbrechtsen · May 4, 2020, 4:49pm

Thanks!

Can you share some more details as to what the issue is? Is there a workaround for those of us that haven’t upgraded yet?

PGriffith · May 4, 2020, 8:19pm

I don't think so, unfortunately; it's low-level internal code. From the fix:

Here was the heart of the problem:

Routes to redundant pairs (that are directly connected) are managed in a special way

The normal routing logic is/was interfering with that (which isn't exactly bad, except...)

when a disconnect occurred, the "redundancy refresh" command was timing out. If the connection was reestablished before that timeout, normal routing would disconnect the route to master, but the refresh (after the timeout), would think the master was good still, and not re-establish the route.