Backup Gateway Active Reasons

jramirez3 · August 24, 2020, 2:33am

Our backup gateway has went to the active state several times here recently. What could be causing this to happen?

I look into the logs and i don’t see anything obvious around the time it went down. Also the CPU and memory usage was normal as well.

Does anyone have any suggestions on what to look at to see what is causing the backup to become active so much?

paul-griffith · August 24, 2020, 3:35pm

The backup will go active if it doesn’t see communication from the master within some threshold time (which is configurable in the backup’s settings).

Check for garbage collection events, or something external like a VM snapshot, that could be preventing an otherwise healthy master from responding within the time allotted by the backup. Especially check the ClockDriftDetector logger on the master.

jramirez3 · August 24, 2020, 3:54pm

So the communication between them is what is used to make the backup server go active or cold state?

Our primary server shows an up time 6 days but the backup server went active twice last night. So that doesn’t mean the primary gateway went down or momentarily restarted does it?

From the logs:

Clock drift, degraded performance, or pause-the-world detected. Max allowed deviation=1000ms, actual deviation=1693ms

Is that saying the time it took for comms between them is taking 1693ms and the max is 1000ms before the backup server goes active?

paul-griffith · August 24, 2020, 4:15pm

No. That logger can be an indication of an issue, totally separate from redundancy (I'll get back to that in a minute).
Basically, every 1 second, wall clock, the gateway attempts to check how long it's been since the last time it checked. If that delta is outside a reasonable threshold (tens of milliseconds, to allow for standard garbage collection pauses) then a warning is logged. No part of this is specific to redundancy.
The three most common issues that the ClockDriftDetector helps to...detect, are:

Excessively long GC pauses, indicating the JVM is struggling to stay within its max memory confines. These events will usually be a few seconds long, and come in more and more frequently until the JVM eventually collapses.
Gateway time synchronization events. Due to the way it's written, the ClockDriftDetector will also notice if the operating system "changes" time out from underneath the gateway - you'll get a single clock drift event of hundreds of thousands of milliseconds, as the timezone changes plus or minus several hours, or something along those lines.
VM interruptions. Things like nightly IT snapshots of VMs or equivalent will often 'pause' the entire system for several seconds at a consistent time. The gateway will detect this as a single gap - from the JVM's perspective, it's got 'missing' time for however long the snapshot took.

So, if you have lots of 'short' clock drift detection logs, it most often indicates memory pressure on the gateway. That could be the problem that's causing the backup to not see the master in time before it attempts to go active. In isolation, I can't say for sure (and you really should get in contact with support about this).

I'm not implying it did. Think about it in isolation - how would a backup gateway reliably detect a master wasn't present? You can't expect the master to send a 'hey, take over now' message, because the master could cease to exist - in the event of a power failure, meteor strike, etc. So the only possible indicator that the backup should go active is it detecting that the master is no longer up. If the master is experiencing long clock drift due to memory pressure, then it will be 'paused' during a periodic time interval when it would have sent the backup a handshake to acknowledge its existence. If you have lots of GC events, then sheer bad luck could have those events coincide with the handshake interval between the master and backup.

jramirez3 · August 24, 2020, 7:00pm

Thank you for your help. I am planning on getting a hold of someone from support just haven’t got the chance. I will tell them to reference this forum for some background on what we’ve talked about here.