No. That logger can be an indication of an issue, totally separate from redundancy (I’ll get back to that in a minute).
Basically, every 1 second, wall clock, the gateway attempts to check how long it’s been since the last time it checked. If that delta is outside a reasonable threshold (tens of milliseconds, to allow for standard garbage collection pauses) then a warning is logged. No part of this is specific to redundancy.
The three most common issues that the ClockDriftDetector helps to…detect, are:
- Excessively long GC pauses, indicating the JVM is struggling to stay within its max memory confines. These events will usually be a few seconds long, and come in more and more frequently until the JVM eventually collapses.
- Gateway time synchronization events. Due to the way it’s written, the ClockDriftDetector will also notice if the operating system “changes” time out from underneath the gateway - you’ll get a single clock drift event of hundreds of thousands of milliseconds, as the timezone changes plus or minus several hours, or something along those lines.
- VM interruptions. Things like nightly IT snapshots of VMs or equivalent will often ‘pause’ the entire system for several seconds at a consistent time. The gateway will detect this as a single gap - from the JVM’s perspective, it’s got ‘missing’ time for however long the snapshot took.
So, if you have lots of ‘short’ clock drift detection logs, it most often indicates memory pressure on the gateway. That could be the problem that’s causing the backup to not see the master in time before it attempts to go active. In isolation, I can’t say for sure (and you really should get in contact with support about this).
I’m not implying it did. Think about it in isolation - how would a backup gateway reliably detect a master wasn’t present? You can’t expect the master to send a ‘hey, take over now’ message, because the master could cease to exist - in the event of a power failure, meteor strike, etc. So the only possible indicator that the backup should go active is it detecting that the master is no longer up. If the master is experiencing long clock drift due to memory pressure, then it will be ‘paused’ during a periodic time interval when it would have sent the backup a handshake to acknowledge its existence. If you have lots of GC events, then sheer bad luck could have those events coincide with the handshake interval between the master and backup.