Redundancy Incompatible Platform

The 8.1.17 changelogs indicated that redundancy would now tolerate version differences:

Blockquote
Platform - Redundancy
Enterprise
Failover to the other redundant node is now allowed if the nodes have different platform versions, which will allow attached clients to remain connected to at least one node during a redundant pair upgrade.

This doesn’t appear to work for me.

Master 8.1.17:

Backup 8.1.18:

Is there something needed to allow this version mismatch tolerance?

IIRC, you have to upgrade the master first.

EAM disagrees:

Huh. I stand corrected. (Though I don’t deal with EAM for my clients. I’ve always upgraded the master first, while the backup carried the load.)

The docs do differ in opinion between a normal upgrade and an EAM upgrade.

Either way, I would expect the pair to negotiate a connection after this change. Else, what was the change about?

If it works as expected, it would allow you force a failover to backup and upgrade the backup. Once done, it should allow you to assume control back to the master, eliminating the hope and pray that it takes over when the backup goes down for upgrade as is currently the case.

I was just about to upgrade and was hoping this was fixed as well. EAM upgrades have been near useless with redundancy the past few versions, I have reverted to manual updates.

It still works, was completely broken in a previous version.

It would be real advantage to have a controlled failover in both directions before and during upgrade, as the change suggests should be possible.

This is probably just the release notes being a bit unclear. The Incompatible (Platform) text on the redundancy page is still expected when the gateways are different versions (this also disables the manual failover and sync buttons). The change revolved around the idea that previously the gateways would reject their connection when they were incompatible, which made it so that Vision and Perspective clients wouldn’t failover to the upgraded backup when the master was down for their upgrade. With the changes, the gateways now connect in “incompatible mode” which allows Vision and Perspective clients to failover to the backup during the master upgrade and then back to the master when it comes back online upgraded. Does that clear it up a bit? Or were you hitting a different issue here?

I’m confused, as the behaviour you’re describing has been working largely that way since 8.0.x.

I’ve tested upgrades using the master and the backup first, going with and against the documentation advice and my procedure have been as follows:

Master upgrade first:

  1. Force failover to Backup
  2. Upgrade Master
  3. Warns about the incompatible versions
  4. Upgrade Backup
  5. Master assumes control
  6. Redundancy syncs up and everything is happy

Backup upgrade first:

  1. Upgrade Backup
  2. Master complains about incompatible versions
  3. Upgrade Master
  4. Backup becomes active when Master goes down
  5. Master automatically assumes control when back up
  6. Redundancy syncs up and everything is happy

How does this change improve things?

So looking into the issue a bit more, it stemmed from a customer issue where they upgraded the master gateway first and what happened was that the backup lost connection to the master during the upgrade and would become active and stay active even after the master was online since the platform was incompatible and therefore the connection would get refused. This caused them to have a ton of duplicated records from their transaction groups.
The changes here made it so that even if the platform versions were incompatible, the redundant pair could still connect so that the master could make the backup go inactive. This also had the side effect of helping clients go back and forth appropriately as the redundant pairs would actually be in the correct active or inactive state since they could communicate with each other even if they were platform incompatible.

Ok, so bottom line, should the EAM upgrade work according to the docs?

AKA - Backup upgrade first:

  1. Upgrade Backup
  2. Master complains about incompatible versions
  3. Upgrade Master
  4. Backup becomes active when Master goes down
  • Clients should auto-transfer to backup at this point
  1. Master automatically assumes control when back up
  2. Redundancy syncs up and everything is happy
  • Clients should auto-transfer to master at this point

Yes it should work like it says in the docs. It’s worth noting that the changes from 8.1.17 have to already exist on both the master and backup nodes. So it’s more of a “It’ll work going forward from 8.1.17” so if you are upgrading from a version lower than that, you will likely run into the previous issues.

So today I performed an upgrade in one of our redundant environments and noticed some difference in behaviour.

Pre v8.1.17 our normal process when upgrading would be to set recovery mode on the master to manual, force failover to the backup and then update the primary. Once the primary comes back online we would give it a period of time to stabilise and the manually initiate the master to take active control.

Seems that with this last update this morning from v8.1.17 to v8.1.18 I was not able to force the master to take over again after the update (as the buttons are now hidden by the incompatible platform info). Instead, rather than adhering to the recovery mode manual setting, it instead took active control itself as soon as the “Startup Connection Allowance” value of 600s expired.

In this case it wasn’t quite enough time for the tag subscriptions to other gateways to finish doing what was needed and as such resulted in a large flood of alarms

Is this now the expected behaviour for v8.1.17+ ?
For now, I have changed the time to 1800s to ensure enough time for start-up in future but unfortunately seem to have no ability to manually trigger it to take back control and will just need to wait for the timer to expire.

I just gave this a shot seeing differences between 8.1.15 → 8.1.16 and 8.1.17 → 8.1.18. What I saw was that in both cases the Master would go from the “undecided/inactive” state to “Active” once the “Startup Connection Allowance” value expired; which is consistent with our documentation on that setting that mentions that the Master will always request the backup to de-activate once the time for that has completed regardless of manual recovery mode.

The big difference was that in 8.1.15 → 8.1.16 when the master upgraded and became active, it did not cause the backup to become inactive. This causes the “split brain” issue that we were solving with the changes in 8.1.17 where both nodes remain active when they shouldn’t be. This probably gave the appearance previously that the master did not become active since the backup would remain active while in incompatible versions before the changes in 8.1.17.

Long story short, your changes to make the “Startup Connection Allowance” longer was the correct way to handle your specific situation, the rest was expected functionality. However, I will make a feature request to have the “Assume Control” button return when the gateways are incompatible and the master is currently not active because of manual recovery mode.

So I have done a bit more testing on this and had a few more issues/questions. Specifically:

  • When upgrading version on the master gateway whilst the backup gateway was still active, after a restart we ended up in an odd “unusable” state whereby the Master was immediately set as active and the backup warm even though the active startup timer of 30 minutes had not completed. This resulted in both gateways unable to load perspective sessions until the 30 minutes had completed
  • Secondly is it possible for the mismatched version gateways to sync or will we be forced to accept a discrepancy (and ultimately alarm retriggers) when upgrading. To avoid alarm retriggers, we require the tag subscriptions to occur prior to the master becoming active. In this version mismatch it does not appear to subscribe to the tags again (assuming because it is unable to sync with the backup?)

During standard restarts without version upgrades, we usually are able to manually fail over, restart the master and then once tags are subscribed and healthy assume control. This also avoids any retrigger of alarms. Keen to know if this is possible to achieve when upgrading versions as well

Andre

Sorry for the late response, I’ve been giving these a try lately and just haven’t been able to reproduce what you have been seeing. I think we’d need support eyes on your system to really get a good idea as to what’s going on.
I will mention that during version mismatch, syncing does not take place as the connection in this instance basically only allows the master to disable the backup. However as of 8.1.19 with change 2393, a lot of the duplicate alarm and tag status issues should function a bit better.