Redundancy in beta?

Sammy5 · November 30, 2018, 8:49pm

I have come across an issue trying to enable redundancy, and thought occurred that this may not be enabled in the beta. So I would like to confirm before spending any more time trying to problem solve. I have attempted to setup redundancy on an existing server, and just now a fresh install.

PGriffith · November 30, 2018, 8:53pm

Redundancy is still getting issues ironed out - I wouldn’t worry about getting it working for now.

ggross · November 30, 2018, 10:37pm

While there are some issues with Redundancy, you should be able to set it and have a good portion of it work. One of the things to note is Redundancy is now handled via WebSockets and occurs using the Gateway Network connections. This occurs on port 8060 and SSL is required.

Configuration items needed are:

Configure > Redundancy: Make sure Master Gateway is setup as master, Backup Gateway is setup as Backup and has the Master Node Address setup properly
Gateway Network > Incoming Connections tab on Master: Approve the Backup, and make sure the connection Status changes to Running state. It will go from Running -> Faulted -> Running as the backup is applied

As noted before there is some stability items that need some work, but a good amount of the functionality is there. Let us know if you can’t get the Gateways to connect.

Sammy5 · December 3, 2018, 3:07pm

Thank you both for the replies. For my setup I had initially not done SSL, or even know about the Gateway network setup. I did have it initially somewhat working when cold state, but when I changed the activity level it never seems to be able to connect. I will let this ride out for now in hopes redundancy portion gets worked out some more in the beta. It is important for me to see how this will work along side distributed services with the new Perspective/web clients.

mgross · December 3, 2018, 3:18pm

You should be able to see a gateway network connection running between the master and the backup under the Gateway Network status page. If that is OK, then you should be able to see a connected backup on the Redundancy status page. Can you specify what is on those pages? Especially when it doesn’t seem to be working?

Sammy5 · December 3, 2018, 3:27pm

On the master, in Gateway Network the backup is “Running” (Incoming) and on the Backup the Master is “Running” (Outgoing). In the main status page, both servers have “Peer Missing” for redundancy. On the Master the Activity Level is Active with Sync status “Good”, but the backup status is “OutOfDate”

Master=
03Dec2018 07:31:50 Redundancy state changed: Role=Master, Activity level=Active, Project state=Good, History level=Full
03Dec2018 07:31:31 System restore initiated by backup node. System will provide a data-only restore file.
03Dec2018 07:31:20 Redundancy state changed: Role=Master, Activity level=Undecided, Project state=Good, History level=Full
03Dec2018 07:24:49 System restore initiated by backup node. System will provide a data-only restore file.

Backup=
03Dec2018 07:31:31 Initiating a data-only restore.
03Dec2018 07:24:49 Initiating a data-only restore.
03Dec2018 07:24:49 Redundancy state changed: Role=Backup, Activity level=Warm, Project state=OutOfDate, History level=Full
03Dec2018 07:24:37 Ignition[state=STARTING] ContextState = RUNNING
03Dec2018 07:24:20 Redundancy state changed: Role=Backup, Activity level=Warm, Project state=Unknown, History level=Partial
03Dec2018 07:24:20 Ignition[state=STOPPED] ContextState = STARTING
03Dec2018 07:24:12 Ignition[state=RUNNING] ContextState = STOPPING

ggross · December 3, 2018, 7:34pm

Is the behavior you are seeing that it is immediately reporting “OutOfDate” or does it look like it connects and then 10-15 minutes later it seems like it isn’t working? We have an open issue because changes on the Master node aren’t allowing the backup to keep up when there are a lot of tag values changing, and the backup keeps having to be restored because it is so far out of date. Does this sound like the issue you are seeing?

Thanks,
Garth

Sammy5 · December 3, 2018, 8:06pm

At the moment these are essentially blank test servers, no incoming tags so no chance the backup is unable to keep up. It reports for both the backup servers as “OutOfDate” and stops there. I have enabled debug level logging on most of the Redundancy loggers, but after the initial startup or connection attempts it seems to stop attempting to connect or log events.

ggross · December 3, 2018, 8:17pm

How close are the clocks on the two systems? If you sync them to an NTP server, does the issue resolve itself? I am trying to get this type of behavior to reproduce and have been unsuccessful so I want to understand what I am missing to duplicate this.

Thanks,
Garth

Sammy5 · December 3, 2018, 8:25pm

Servers all use ChonyD for time updates.

I can install 2x new servers with the latest build tomorrow from scratch in a VM and see if this issue replicates.

Sammy5 · December 4, 2018, 2:36pm

I setup 2x new VMs with Oracle Linux v7.6. No Java pre-installed, downloaded new version of Ignition 8 this morning. Turned off firewall and disabled SELinux. Made sure SSL was enabled for gateway, setup redundancy and finally approved on the master the backups gateway connection. Both servers showed there was a Gateway connection from the other or to the other. However the system still reports that redundancy is enabled but neither is connected to the other. I have the wrapper logs I can tar up and send if you want to view them. Otherwise since the redundancy portion is still a work in progress with v8 I am going to hold off any further work on this for now.

One factor I have not tested is that I am setting these up with activity level warm, whereas I did see this working when set to cold initially. I saved the VM state so I may start over later to test with activity level cold.

mgross · December 4, 2018, 10:31pm

Yeah, the wrapper logs would be handy for both the master and the backup. If you could send them over, that would help.

ggross · December 4, 2018, 11:47pm

I did setup an environment with VMs running Oracle Linux v7.6 and was able to get redundancy setup. A couple of things that might be worth checking when setting this up:

Is port 8060 open via the firewall? firewalld is enabled by default and could be blocking connections
Do the VMs have unique hostnames? hostnames are a critical part of the gateway network and we have run into issues where not following all procedures needed to change a Linux hostname has led to being unsuccessful in getting redundancy to work. If you have cloned VMs, this is more likely to happen.

Other than that, the logs that Matt has asked for would be the next manner to start troubleshooting this issue.

Sammy5 · December 5, 2018, 4:12pm

The hostnames are unique (ign81 and ign82) with system gateway settings names ign81 and ign82. I cleared the logs and did a fresh start of both servers, and then copied the logs over along with some screenshots of my settings. Where and how would you like me to upload these? Zip file is about 1.5 MB.

ggross · December 5, 2018, 7:42pm

You should be able to send me a direct message through the forum with the attachments. You can do this by:

Clicking on my name
Clicking Message button
Clicking the Upload button in the message area
Attaching the .zip file

If you run into any problems let me know and I will find an alternate method.

ggross · December 10, 2018, 7:32pm

Hi Sammy,

I did get the logs, but sent a follow-up request via Private Message. If you could get a chance to look at it I would appreciate it.

Thanks,
Garth

r.lebohec · February 18, 2019, 4:24pm

Actually, I have exactly the same issue with 7.9.9 Ignition version on ower Gateway redundancy peer.
Master Node:

Backup Node:

node clock is synchronized

Sammy5 · February 19, 2019, 4:55pm

On the status page, does the master show all is good, and the backup server report for sync “outofdate”?

Do you have other Ignition servers running besides this pair? Reason I ask, what I experienced was issues on a network with multiple Ignition servers. When I setup 2 new ones in a different VLAN the redundancy worked without issue. I can now setup v8 servers on the same subnet as other running Ignition servers, but the backup always reports “out of date”.

ggross · February 27, 2019, 9:59pm

r.lebohec,

If you are experiencing this issue on a 7.x version of Ignition, I recommend contacting support. Being able to talk things through and understand what the current setup is will help prevent you from trying a lot of things that may not be needed.

Sammy5,

On the Master Gateway’s Redundancy status screen, do you see a large amount of changes being queued when looking at the Config Updates Queue chart? Does the the backup gateway successfully restore a backup? Most of our testing occurs within the same subnet with multiple gateways running, just trying to understand where the failure is occurring.

Garth

aaron1 · May 3, 2019, 2:29am

Hi all,

Just following up on this thread to see if the issues with redundancy in Ignition 8 have been fixed? If not is there an estimated time frame when the updated version will be released? Many thanks.