[IGN-6789] Errors On New Install Hub and Spoke Edge Gateways (UascServerHelloHandler)

Background: I'm proving out a hub and spoke system using Ignition 8.1. I'm using Ignition Standard for the hub and Ignition Edge for the spokes. The hub is configured to have 3 incoming gateway network connections - one from each of the spokes.

I have only done minor configuration on each of the gateways and none of them have visualization. On one of the gateways I haven't done anything except make the gateway network connection.

Setup:

  • Containerized setup using Docker 4.6.0
  • I have hub, 3 spokes, and Postgres db wrapped up in one pod

The Problem: On each of the spoke gateways (not the hub), I have errors showing up in the log about every 30 seconds from the "UascServerHelloHandler" logger indicating that there is an out of memory error on Direct Buffer Memory.

Error message below:
Message:
(The number after "127.0.0.1:" always increases with each error. In this case it's 57748)

[remote=/127.0.0.1:57748] Exception caught; sent ErrorMessage{error=StatusCode{name=Bad_UnexpectedError, value=0x80010000, quality=bad}, reason=Direct buffer memory}

Details:

at java.base/java.nio.Bits.reserveMemory(Unknown Source)
at java.base/java.nio.DirectByteBuffer.(Unknown Source)
at java.base/java.nio.ByteBuffer.allocateDirect(Unknown Source)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:645)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:621)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:204)
at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:174)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:136)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:128)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:378)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:139)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:150)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Unknown Source)

Strange. Is the loopback OPC UA connection faulted? Do these errors start happening immediately on restart or after some time? Can you upload the full logs for one of those spokes somewhere?

I want to think this is somehow related to Docker but I don’t have the skills to diagnose or troubleshoot that. @kcollins1 might have an idea?

edit: you might first just try increasing the memory available to your container images…

Agreed on the need for more full logs. You mention having these all in a single “pod”, which is making me infer that you’re deploying to Kubernetes (via the K8s distribution facilitated by Docker Desktop)?

If the containers are sharing a pod, that means they’re sharing network resources and you’re probably running into conflicts with the local OPC-UA server binding (to port 62541 of loopback interface). This would be akin to running multiple instances of Ignition within a single traditional Docker container (or Linux machine), with some of the same complications. Any reason not to have them in different pods?

Either way, I think you’ll need to give us more insight on your deployment definition, because it can definitely impact how things work.

Sorry, maybe wrong terminology. I am running several services through one docker compose file on docker (image: inductiveautomation/ignition:8.1.13). So… that’s what I’m doing. I’m not using Kubernetes. They do all share the same network though.

I have a mount to /usr/local/bin/ignition/data, but not to logs. I tried to get the logs using docker cp, but running CMD with admin privileges, I still only got a sim link to the wrapper logs… The .idb file I downloaded from the gateway logs is too large to attach (13mb).

The OPC-UA binding is actually faulted. This could be the source of the issue. I was hoping to keep them all in the same… I don’t know the right terminology… container with multiple containers inside(??) for organizations sake. But if it doesn’t work, there’s not a real reason not to separate.

Oh, also wondering why the OPC server would be faulted on the 3 edge servers, but not on the hub?

Can you share your Compose definition?

file: docker-compose.yml

version: "3.1"

services:
  3pdc-dev1-hub:
    image: inductiveautomation/ignition:8.1.13
    hostname: 3pdc-dev1-hub
    container_name: 3pdc-dev1-hub
    ports:
      - "8300:8088"
    volumes:
      - hub_data:/usr/local/bin/ignition/data
    env_file:
      - ./standard.env

  3pdc-dev1-spoke1:
    image: inductiveautomation/ignition:8.1.13 
    hostname: 3pdc-dev1-spoke1
    container_name: 3pdc-dev1-spoke1
    ports:
      - "8301:8088"
    volumes:
      - spoke1_data:/usr/local/bin/ignition/data
    env_file:
      - ./edge.env
    environment:
      - GATEWAY_NETWORK_0_HOST=3pdc-dev1-hub  

  3pdc-dev1-spoke2:
    image: inductiveautomation/ignition:8.1.13 
    hostname: 3pdc-dev1-spoke2
    container_name: 3pdc-dev1-spoke2
    ports:
      - "8302:8088"
    volumes:
      - spoke2_data:/usr/local/bin/ignition/data
    env_file:
      - ./edge.env
    environment:
      - GATEWAY_NETWORK_0_HOST=3pdc-dev1-hub
      
  3pdc-dev1-spoke3:
    image: inductiveautomation/ignition:8.1.13 
    hostname: 3pdc-dev1-spoke3
    container_name: 3pdc-dev1-spoke3
    ports:
      - "8303:8088"
    volumes:
      - spoke3_data:/usr/local/bin/ignition/data
    env_file:
      - ./edge.env
    environment:
      - GATEWAY_NETWORK_0_HOST=3pdc-dev1-hub

  3pdc-dev1-db:
    image: postgres:14.2
    hostname: 3pdc-dev1-db
    container_name: 3pdc-dev1-db
    ports:
      - "5300:5432"
    volumes:
     - db_data:/var/lib/postgresql/data
    environment:
     - POSTGRES_PASSWORD=passwordHidden

volumes:
  hub_data:
    driver: local
    driver_opts:
      type: none
      device: "C:/Workspace/Docker-Compose/3pdc-gwn-dev1/ignition/hub"
      o: bind
  spoke1_data:
    driver: local
    driver_opts:
      type: none
      device: "C:/Workspace/Docker-Compose/3pdc-gwn-dev1/ignition/spoke1"
      o: bind
  spoke2_data:
    driver: local
    driver_opts:
      type: none
      device: "C:/Workspace/Docker-Compose/3pdc-gwn-dev1/ignition/spoke2"
      o: bind
  spoke3_data:
    driver: local
    driver_opts:
      type: none
      device: "C:/Workspace/Docker-Compose/3pdc-gwn-dev1/ignition/spoke3"
      o: bind
  db_data:
    driver: local
    driver_opts:
      type: none
      device: "C:/Workspace/Docker-Compose/3pdc-gwn-dev1/db"
      o: bind

file: standard.env

ACCEPT_IGNITION_EULA=Y
GATEWAY_ADMIN_USERNAME=user
GATEWAY_ADMIN_PASSWORD=passwordHidden
IGNITION_EDITION=standard

file: edge.env

ACCEPT_IGNITION_EULA=Y
GATEWAY_ADMIN_USERNAME=user
GATEWAY_ADMIN_PASSWORD=passwordHidden
IGNITION_EDITION=edge

No problem, I made an assumption based on "pod". Had you used "stack", I might not have steered in that direction. Under Docker and with Docker Compose, each of these "services" are going to be brought up as individual containers on a common Docker "network". However, unlike containers in a pod in K8s, each container will have its own IP address. So you shouldn't be in the same trouble I was describing before.

I definitely don't recommend using bind mounts under Docker Desktop, as they typically perform horribly compared to named volumes. However, we'll table that for now--the data volumes should work the way you have them defined [with the long syntax ensuring that they're preloaded with pre-existing image data at the target location].

With respect to the logs, you'd retrieve those with something like this (targeting your hub service):

docker compose logs 3pdc-dev1-hub > hub-logs.txt

This would save a hub-logs.txt into the current directory on your host. Since Docker is collecting the stdout from the container in its own log store, we don't duplicate that storage in the container, thus the symlink you see there to redirect to stdout.

What you're doing to manage these is fine and you shouldn't be seeing bind failures with these. However, for whatever reason, it does seem like something went wrong here. I'd recommend the following sequence of events:

  1. Shutdown everything and clear out the volume definitions with a docker compose down -v.
  2. With the way you've setup the bind-mounts, this won't actually clear out the local files in the ignition/xxx and db sub-folders you've defined.
  3. Remove the ignition/xxx and db sub-folders. Recreate them all as empty folders.
  4. Restart Docker Desktop
  5. Then bring things back up with docker compose up -d.

I was able to bring your Compose solution in and get everything up-and-running without issue, so I'm hoping this sequence works for you and that there was just some lingering stuff related to those volumes causing trouble.

Okay, back to those volumes.. I'd really recommend just going to named volumes. If you really need to peek at files in a given container (and associated volume), something like VSCode and their Docker extension makes this really easy:

Docker Dashboard even has some improved volume interactions nowadays:

Hopefully some of this helps you get going!

I have the exact same UascServerHelloHandler error happening, but no docker involved in my setup. I’m running 8.1.13 Edge (aarch64) on a raspberry pi 4 (4GB) with Ubuntu 20.04.

The gateway will run fine for quite a while, sometimes days, but eventually the OPC UA server will fault with this error. Restarting the OPC-UA module doesn’t fix it, only a gateway restart. I’ve tried several Pi’s with the same result.

It would be helpful to get some more complete logs (specifically looking for some other causal events). I’ll spool up an instance on a spare RaspPi 4 (also 4GB, also ubuntu 20.04) that I’ve got here. This one is also one running off of an SD card. The only common link here that I see so far is that both of these setups might exhibit some slow I/O. That is only a surface level observation though, no conclusions yet.

If you are able to send some more comprehensive logs, I’d be happy to take a deeper look.

kcollins,

Were you able to find a solution to the problem kmorning mentioned with the raspberry pi 4 above?

I wasn't able to reproduce that scenario in my admittedly limited amount of testing. Do you have any logs that you could share?

Those were copied from the Status > Connections > OPC Connections page when the Ignition OPC UA Server is showing a "Fault" Status. I've also uploaded the wrapper log.

java.lang.Exception: session inactive: id=NodeId{ns=1, id=Session:b26d8473-d804-4415-b21e-1c709344944a} name=ignition[hwy15-pi-ignition]_Ignition OPC UA Server_1666795822330
at com.inductiveautomation.ignition.gateway.opcua.client.connection.OpcUaConnection$MiloSessionActivityListener.onSessionInactive(OpcUaConnection.kt:499)
at org.eclipse.milo.opcua.sdk.client.session.SessionFsmFactory.lambda$null$39(SessionFsmFactory.java:577)
at java.base/java.util.concurrent.CopyOnWriteArrayList.forEach(Unknown Source)
at org.eclipse.milo.opcua.sdk.client.session.SessionFsmFactory.lambda$configureActiveState$40(SessionFsmFactory.java:577)
at com.digitalpetri.strictmachine.dsl.ActionBuilder$PredicatedTransitionAction.execute(ActionBuilder.java:119)
at com.digitalpetri.strictmachine.StrictMachine$PollAndEvaluate.lambda$run$0(StrictMachine.java:242)
at java.base/java.util.ArrayList.forEach(Unknown Source)
at com.digitalpetri.strictmachine.StrictMachine$PollAndEvaluate.run(StrictMachine.java:227)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

8.1.15 (b2022030114)
Azul Systems, Inc. 11.0.14.1

wrapper.log (6.2 MB)

I don't see that particular log within the wrapper log, perhaps it was from one of the older wrapper log files . Either way I was able to glean that there you're using Edge edition and are polling some modbus devices and shipping some data to MQTT. I'll go ahead and setup a longer-running test bench (on my Pi 4) with some of these things in play to see if I can observe any memory issues and similar errors.

Yes, that is correct. Polling two modbus devices connected to a Raspberry Pi 4 which is running an Ignition Edge Gateway. The gateway has the MQTT Transmission module installed and it is publishing to an MQTT broker. I will mention that this OPC UA Server fault does occur with the MQTT Transmission module disabled as well.

Well, I've setup a test environment polling some registers from a modbus simulator and publishing up to a test MQTT broker.. I'll let you know if I see any failures!

Polling two modbus devices connected to a Raspberry Pi 4

Can I ask how you connected them? Im looking at adding GPS (modbus) to my pi's running edge but all solutions I found are some sort of Hat. I have my pi's in cases for the cooling fan etc so a HAT wouldn't fit.

So it looks like I was able to reproduce this, although with some interesting base circumstances. I also used 8.1.15 and targeted Edge as you had. For my initial test, I just reset the trial every time it expired. It resetting the Trial Timer triggers a restart of a OPC-UA module (and dependents, here is a snippet from your log file, @Gerry_Dion, in which you seem to be doing a similar thing):

INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.LicenseManager              ] [11:18:10]: Trial time reset.  Time remaining = 7200. route-group=status, route-path=/trial
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [Project                       ] [11:18:10]: Restarting gateway scripts... project=Edge
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Restarting module "com.inductiveautomation.opcua" module-name=OPC-UA
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Stopping dependent module "com.inductiveautomation.opcua.drivers.ablegacy" module-name=OPC-UA
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua.drivers.ablegacy' v6.1.15 (b2022030113)... module-name=Allen-Bradley Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutdown of module "com.inductiveautomation.opcua.drivers.ablegacy" completed in 1 ms module-name=Allen-Bradley Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Stopping dependent module "com.inductiveautomation.opcua.drivers.modbus" 
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua.drivers.modbus' v7.1.15 (b2022030113)... module-name=Modbus Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutdown of module "com.inductiveautomation.opcua.drivers.modbus" completed in 2 ms module-name=Modbus Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Stopping dependent module "com.inductiveautomation.opcua.drivers.siemens" 
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua.drivers.siemens' v6.1.15 (b2022030113)... module-name=Siemens Drivers
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutdown of module "com.inductiveautomation.opcua.drivers.siemens" completed in 2 ms module-name=Siemens Drivers
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Stopping dependent module "com.inductiveautomation.opcua.drivers.bacnet" 
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua.drivers.bacnet' v2.1.15 (b2022030113)... module-name=BACnet Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutdown of module "com.inductiveautomation.opcua.drivers.bacnet" completed in 3 ms module-name=BACnet Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Stopping dependent module "com.inductiveautomation.opcua.drivers.logix" 
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua.drivers.logix' v5.1.15 (b2022030113)... module-name=Logix Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutdown of module "com.inductiveautomation.opcua.drivers.logix" completed in 6 ms module-name=Logix Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Stopping dependent module "com.inductiveautomation.opcua.drivers.dnp3" 
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua.drivers.dnp3' v4.1.15 (b2022030113)... module-name=DNP3 Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutdown of module "com.inductiveautomation.opcua.drivers.dnp3" completed in 2 ms module-name=DNP3 Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Stopping dependent module "com.inductiveautomation.opcua.drivers.omron" 
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua.drivers.omron' v4.1.15 (b2022030113)... module-name=Omron Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutdown of module "com.inductiveautomation.opcua.drivers.omron" completed in 0 ms module-name=Omron Driver
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Stopping dependent module "com.inductiveautomation.opcua.drivers.tcpudp" 
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua.drivers.tcpudp' v6.1.15 (b2022030113)... module-name=UDP and TCP Drivers
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutdown of module "com.inductiveautomation.opcua.drivers.tcpudp" completed in 1 ms module-name=UDP and TCP Drivers
INFO   | jvm 1    | 2022/10/04 07:18:10 | I [g.ModuleManager               ] [11:18:10]: Shutting down module 'com.inductiveautomation.opcua' v9.1.15 (b2022030113)... module-name=OPC-UA

With each of these events, I saw an increase of around 50 or so MB of non-heap memory. Heap memory was pretty stable. It looks like it is probably related to Direct Buffer Memory allocated by Netty (within the OPC-UA Milo stack).

Thankfully, when I ran the same setup with a license (I ended up using Maker for this test), and as a result no periodic module restarts after trial reset, things were very stable and ran just fine.

Also, FWIW, under 8.1.21 (also still under Docker on RaspPi4), performing manual restarts of the OPC-UA module seemed to produce a similar increase in non-heap memory usage over time, but it seems that the threshold for being denied additional Direct Buffer memory is different--things continued to function as expected. The memory usage will continue to climb incrementally upon trial resets (and thusly OPC-UA module restarts), but it doesn't seem to experience premature OOM errors like before. Perhaps this is due to some of the Milo/Netty updates along the way, or the newer point release of JDK11? I haven't researched it fully.

Restarting the OPC UA module has been known to leak in the past. I'm not sure it's ever been fixed.

Modules being restarted when the trial resets is surprising behavior to me. Is that only with Edge?

We're using the USB ports on the front of the Raspberry Pi