We have a ticket in progress to fix this. The workaround in the meantime is to reset the backup's gateway network connection on the master gateway, and the connection should go back to normal.
I’m facing another problem, and I’m not sure what else I could try.
Here’s my current setup:
2 Gateways, both running as Docker containers
A Traefik reverse proxy in front of them
I’ve already tried many different configurations:
Clients via HTTPS with SSL offloading on Traefik → Gateway port 8088
Gateways connect via SSL on port 8060
Clients via HTTPS with SSL offloading on Traefik → Gateway port 8043
Gateways use a self-signed certificate, connecting via SSL on port 8060
Clients via HTTPS with SSL offloading on Traefik → Gateway port 8088
Gateways connect via port 8088 without SSL
Clients via HTTPS with SSL offloading on Traefik → Gateway port 8043
Gateways use a self-signed certificate, connecting via port 8088 without SSL
Clients connect directly to Gateway port 8043
Gateways use a self-signed certificate, connecting via port 8088 without SSL
Clients connect directly to Gateway port 8043
Gateways use a self-signed certificate, connecting via SSL on port 8060
Clients connect directly to Gateway port 8088
Gateways connect via port 8088 without SSL
Clients connect directly to Gateway port 8088
Gateways connect via SSL on port 8060
In almost all of these scenarios, redundancy works as expected. The manual switchover by pressing the button on the gateway also works correctly.
However, the automatic switchover in the Perspective session only works if both clients and gateways connect directly to port 8088.
I have no idea what could be causing this issue. Manual switchover, synchronization, and even the automatic failback (when the master comes back online) all work perfectly. It’s just the automatic switchover in the Perspective session that fails in every other configuration.
It looks like there is a separate ticket in progress to fix redundancy failover for Perspective sessions. From the description on the ticket, it sounds similar to the situation that you have described.
I saw, that in the nightly the bug should be fixed.
Now I have tried. Wothout the reverse Proxy it works for me. With the Reverse Proxy dont.
I have an idea what could be the problem.
The difference between with and without reverse proxy is, that with reverse proxy the Hello Request returns HTTP-Error 402 and without reverse proxy the hello returns Timeout.
So my guess is, that the session only forward to the backup, if the hello request returns Timeout. Could it be like this?
onHelloRejected(e) {
var t, o;
const n =
null === (t = null == e ? void 0 : e.response) || void 0 === t
? void 0
: t.status;
(x.error(
() =>
`Hello API call failed. Code=${null == e ? void 0 : e.code}. Status=${null == e ? void 0 : e.status}. Message=${null == e ? void 0 : e.message}`,
),
n
? 404 === n
? this.transition(O.ClientActions.NO_PROJECT)
: (null === (o = this.idle) || void 0 === o
? void 0
: o.maybeTriggerIdleTimeoutAction()) ||
this.scheduleNextHelloCheck()
: this.maybeRedirectToPeer());
}
If I understand the client script in the perspective session correct, than the client only redirect, if the master dont send any status code. But if the gateway is behind a traefik reverse proxy then this dont work.
The scenario is, that the gateway docker stops working but the traefik reverse proxy still works. Then the reverse proxy answers with 502 or 404, depends on the configuration.
Maybe this should be changed, also to switch if the statuscode is something like 502?
I don’t have a good answer on how to deal with the reverse proxy in this situation, but I am asking around on our side to see if anyone else has any ideas.
When I trigger the redundancy switch via the reverse proxy, the network trace shows that the hello request is always sent, but the still-active reverse proxy responds with 502 Bad Gateway. As mentioned earlier, the perspective error handling only redirects to the backup if the request receives no response.
In the second screenshot you can see a working redirect. In this test I accessed the master directly (bypassing the reverse proxy). After I stopped the master, the redirect happened immediately. In this case, the hello request to the master returns a network error with no status code.
Interpretation: Behind the reverse proxy, a fast HTTP 502 is treated as a (valid) response, so the failover doesn’t trigger. When hitting the master directly, the lack of any response triggers the failover as expected
Yes, your detective work is correct. A return of no status, or a status code of 0, when we ping the Gateway is the only thing that will attempt a redirect. We have no ticket for this, which makes me think that it isn't a problem. Perhaps it's more common for the entire host server to go down? Or, they're using a load balancer with multiple front-ends in parallel. Or, some have figured out how to map a 502 to a 0 in the proxy config (mod_rewrite if using Apache server)? That's what I would suggest while you wait for us to make this configurable.
With Traefik as a reverse proxy, I think it’s not possible to perform a rewrite. Maybe you’re right that in most cases the host goes down, but I also believe that if the documentation describes how to use Ignition with a reverse proxy and Docker containers, then it should also work in that setup — even if only the Ignition container stops working.
Is it planned in the development timeline to make this possible? Dose it make sense to do a feature request? If yes, where can I do this?