Gateway Exception: Connection refused

ezra.geniza · February 4, 2021, 7:53pm

I added clients in our network from 100 to 120, and all of a sudden all the clients, including designer clients, started getting a ‘Gateway Exception: Connection refused’ error. Is there a maximum amount of clients allowed? And if there is, is there a way to increase it?

I actually had this issue before, and the solution I had was to remove some of the clients, which seemed to work. Unfortunately, that’s not an option anymore. Any help would be appreciated.

connection_refused

pturmel · February 4, 2021, 8:52pm

Platform? Could be a resource limit in the OS for the service.

ezra.geniza · February 4, 2021, 9:20pm

The OS is a Windows Server 2016. Which resources should I pay attention to? I increased available RAM in ignition-config, but it doesn’t seem to help.

pturmel · February 4, 2021, 10:14pm

I don’t do Windows much. If you’d said Linux, I’d have pointed you towards “ulimit” settings.

kcollins1 · February 4, 2021, 10:58pm

Sounds like it could be socket exhaustion on the server side… Can you run a netstat -an from Command Prompt? I’d be looking for clues here, such as a bunch of old connections in TIME_WAIT state, gobbling up all of your available ports.

ezra.geniza · February 5, 2021, 3:49pm

Good call. I do see some TIME_WAIT connections. I will see if I can get rid of them.

I also notice that some ESTABLISHED connections connect to the same client. Would this be due to tags and queries? Is there a way to reduce these to (ideally) one connection each?

pturmel · February 5, 2021, 3:55pm

Are your clients in the same subnet as the server? Any chance there’s a network device in the routing path that has a resource limit? Like a wimpy NAT connection tracking table, perhaps.

kcollins1 · February 5, 2021, 4:58pm

There is a way to limit concurrent connections between the client and the gateway, see the Connection Concurrency option in Project Properties->Vision->Timing:

Limiting this will constrain the number of connections used by the client to communicate with the client.

Herbie · February 5, 2021, 7:32pm

I believe there is a hard limit set on the gateway defined in the gateway.xml file. I think this parameter is catapult.acceptCount (default is 100), and when reached will immediately refuse connections coming from additional clients. I’m not sure what the max should be set to, but you might try increasing this number and see if this resolves your problem.

ezra.geniza · February 5, 2021, 7:50pm

Yes, all clients are in the same subnet as the server. If it’s an NAT issue, that’s pretty bad for me since I can’t change that right now.
The good thing is that limiting the concurrent connections seemed to help out. At least I can actually use designer now. I will also try increasing accept count in gateway.xml to see if it helps. Thanks a lot.

Kevin.Herron · February 5, 2021, 7:53pm

When you look at the threads page in the gateway status are there pretty much always 300 HTTP threads active? Can you upload a thread dump for us to look at?

ezra.geniza · February 8, 2021, 5:20pm

Hi Kevin,

Unfortunately, I cannot upload just yet since this user account is still new. However, I can say that most of the threads are in WAITING or TIMED_WAITING states. Not sure what can be done with it. I’ll include a couple of lines below:
Daemon Thread [catapult-filemonitor] id=19, (TIMED_WAITING) waiting for: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@a55304e sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source) java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) java.lang.Thread.run(Unknown Source) Daemon Thread [catapult-filemonitor-handler] id=34, (WAITING) waiting for: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@43515b5f sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(Unknown Source) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source) java.util.concurrent.LinkedBlockingQueue.take(Unknown Source) java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) java.lang.Thread.run(Unknown Source) Daemon Thread [com.jniwrapper.NativeResourceCollector] id=88, (TIMED_WAITING) waiting for: java.lang.ref.ReferenceQueue$Lock@5cbf0449 java.lang.Object.wait(Native Method) java.lang.ref.ReferenceQueue.remove(Unknown Source) com.jniwrapper.a.run(SourceFile:158) Daemon Thread [cron4j::scheduler[gateway-shared-exec-engine]::timer[b0b4bc633b08035c67762d100000017772b4502f4f76e13a]] id=221, (TIMED_WAITING) java.lang.Thread.sleep(Native Method) it.sauronsoftware.cron4j.TimerThread.safeSleep(Unknown Source) it.sauronsoftware.cron4j.TimerThread.run(Unknown Source) Thread [cron4j::scheduler[reporting]::timer[b0b4bc633b08035c67762d100000017772b45a1361604e62]] id=224, (TIMED_WAITING) java.lang.Thread.sleep(Native Method) it.sauronsoftware.cron4j.TimerThread.safeSleep(Unknown Source) it.sauronsoftware.cron4j.TimerThread.run(Unknown Source) Thread [cron4j::scheduler[taskmanager]::timer[b0b4bc633b08035c67762d100000017772b451d32d4d2483]] id=223, (TIMED_WAITING) java.lang.Thread.sleep(Native Method) it.sauronsoftware.cron4j.TimerThread.safeSleep(Unknown Source) it.sauronsoftware.cron4j.TimerThread.run(Unknown Source)

Kevin.Herron · February 8, 2021, 5:23pm

You should probably get in touch with support so they can take a look and receive your full thread dump. This is a very small portion of the overall threads and not any of the relevant ones.