Recently, our IT departement swapped out our network harware in favor of SD-WAN technology. After that change we noticed a degredation in performance across our clients (95% of our clients are Vision, though there is some perspective being used).
To make a long story short, in looking into potential bottlenecks, they found that there was a significant amount of data passing from the Gateway to the SQL server. The new technology now uses 256bit encryption (except on one tunnel which they changed to 128bit and saw an increase in performance).
What they saw was a 3GB request in a single session.
My question, while I understand that a session can consist of multiple transactions, which in turn can consist of multiple queries, 3GB still seems like a huge ask for a query.
Am I just unfamiliar enough with how the JDBC driver works, that what I see as odd is actually expected? Or is there some thing that we have done in our project which is leading to these large requests.
Of note, the large requests seem to be on a 2 min. cycle. I have IT running a Trace on the SQL server to see if I can narrow down, where the query is originating from. Also of note, the return size of this particular request is around 11GB, which also seems very large to me, but if were making a 3GB request, then I can understand the size.
All gateway logs are fine, no extar long running queries are being reported. Performance metrics on the Gateway also look good to me.
Has anyone seen anything like this before. Any information or discussion around this would be appriciated.
Does this mean your production history database connection is over a WAN? Ewww!
Don't get hung up on DB "sessions", as Ignition pools them and re-uses them. Any particular query can run on any of the open sessions. Sessions won't combine multiple queries into a transaction unless you have deliberately invoked Ignition's transaction start/end functions.
What sorts of queries are generally running inside your system? a 3 GB query looks like itâs some kind of âbatching together a bunch of smaller queries into a unioned runprepquery to optimize for minimizing transit hops between the db and the gatewayâ, but this is probably going to require digging into the code
Do you have any trends running in realtime mode? In Perspective if I open a powerchart with 5 pens, 1 day range, and set to realtime then I'll see 20MB/s sustained traffic on our gateway until the trend is closed or switched to historical mode.
Unsure about Vision but that was a large source of traffic I noticed initially and then had to change the trends to use historical instead. Initially I thought it'd be smart enough to only poll the new data but it seems like it queries the entire history every poll.
If you run a traceroute (I assume you are running windows on one of these servers) between the gateway and the SQL, what sort of latency and how many hops are there between the two?
I have seen this sort of thing before where IT sets up an SD-WAN where they have a brand new set of routing tables system wide. This means you find new cases where routing will redirect subnet-subnet traffic via main routers that are on remote systems for odd reasons also they may have installed new firewalls that do the SD-WAN that are slower at routing than the old routers, so the higher traffic and encryption requirements will slow the whole process down.
Definitely worth the traceroute and make sure the path that is being taken is the expected path for the data.
Thereâs an open source program called âWinMTRâ that is great for detecting packet loss on a path also if youâre looking to monitor for packet loss.
I googled âSD-wan 2 min cycleâ and found something about Cisco IOS XE SD-WAN device behind a NAT device rotating through a list of base ports to establish DTLS connections when a connection attempt is unsuccessful. It could be a coincidence that youâre seeing large requests on a 2 minutes cycle but I thought Iâd mention this in case itâs helpful.
Search this documentation for â2 minutesâ to find what Iâm talking about.
Hereâs a copy/paste of the text of the part of the manual I am referring to:
For a tunnel interface (TLOC) on a Cisco IOS XE SD-WAN device behind a NAT device, you can configure the interface to rotate through a pool of preselected OMP port numbers, known as base ports, to establish DTLS connections with other WAN edge devices when a connection attempt is unsuccessful. By default, port hopping is enabled on Cisco IOS XE SD-WAN devices and on all tunnel interfaces on Cisco IOS XE SD-WAN devices.
There are five base ports: 12346, 12366, 12386, 12406, and 12426. These port numbers determine the ports used for connection attempts. The first connection attempt is made on port 12346. If the first connection does not succeed after about 1 minute, port 12366 is tried. After about 2 minutes, port 12386 is tried; after about 5 minutes, port 12406; after about 6 minutes, port 12426 is tried. Then the cycle returns to port 12346.
If you have configured a port offset with the port-offset command, the five base ports are a function of the configured offset. For example, with a port offset of 2, the five base ports are 12348, 12368, 12388, 12408, and 12428. Cycling through these base ports happens in the same way as if you had not configured an offset.
Worthwhile running a trace to see if any obvious route issues jump out. Even if the route looks good...I would see if 'they' are onboard with a temporary test to place the SQL server and Ignition server on the same subnet...invasive as it might be.
Even if performance can be restored to clients, 3GB (in a short? duration) seems excessive (even with asynchronous routing). Might be worthwhile taking advantage of the poor performance to pinpoint & fix any views which have long-running (large) queries configured by default.