Ignition gateway network errors. (using docker overlay network)

tordvd · March 16, 2021, 12:00pm

Ignition 8.1.3
I am getting web socket connection errors quite often. It does not look like I am having problems other than the logs being a bit spammed.

Just a note: I am using hostnames within the docker overlay network kttruck1 and goodcloud1 as can be seen below.

Does anyone know why these errors are coming?

kcollins1 · March 16, 2021, 2:30pm

Are these containers running stand-alone or through an orchestration layer? It looks like some of the errors are indicating name resolution problems. If a container is removed and recreated, there would be a temporary failure in name resolution on the overlay network.

kcollins1 · March 16, 2021, 3:01pm

Also, are these containers on a custom overlay network? (i.e. not the built-in bridge network?). I imagine they are, and in this case, the containers should be using Docker’s embedded DNS (you should see /etc/resolv.conf in the container pointing to 127.0.0.11)… I suppose it is possible that there is some periodic hiccup from that causing the Temporary failure in name resolution.

EDIT: a quick test on my end of stopping the destination container for an outgoing GAN connection shows that you’d get a different error, Name or service not known, in that case.

tordvd · March 17, 2021, 7:48am

Yes, as you mentioned these errors shows up every 30-60 minutes. (while the containers are not stopped or anything like that)
For example, this morning when I logged on I had 5
java.net.UnknownHostException: kttruck1: Temporary failure in name resolution
within the last 2 hours.

This is just for one outgoing connection. I will connect 3 more very soon, giving me very messy logs.

Yes, I am using a custom docker swarm overlay network.

This is true:

Any idea how this can be further troubleshooted?

kcollins1 · March 17, 2021, 1:57pm

Are the services deployed across multiple machines (i.e. is it a multi-node swarm cluster or just single node)? What version of Docker Engine are you running (perhaps output of docker system info)?

I’d like to try a simple reproduction… I’m not aware of any particular issues with the embedded Docker DNS server… You could probably workaround this issue by locking in your IPs on the overlay networks and using the extra_hosts: configuration option… I think this is more of a Docker / Docker Swarm issue though, Ignition may just be the victim of periodic failed DNS lookups.

tordvd · March 17, 2021, 3:35pm

The services are stand alone services deployed across multiple machines. They are deployed by running docker-compose up manually on each machine.

Some non-default settings that might or might not affect my issues:

SSL is disabled, meaning port 8088 used for gateway networking
Gateway network ping is set to 20s on all outgoing connections
The aliases defined by the compose files are used as hostname (while setting up gateway connections, mqtt etc.)

docker system info for the main server(azure):

:~$ docker system info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 13
  Running: 13
  Paused: 0
  Stopped: 0
 Images: 11
 Server Version: 20.10.3
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: 3cc490fsann55nah5efdezb1f
  Is Manager: true
  ClusterID: pvtycksijyd5cbfon340db916
  Managers: 1
  Nodes: 5
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: REDACTED (because forum)
  Manager Addresses:
   REDACTED (because forum)
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-1040-azure
 Operating System: Ubuntu 20.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.776GiB
 Name: REDACTED (because forum)
 ID: REDACTED (because forum)
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: REDACTED (because forum)
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support

Ignition main gateway compose file:

version: '3.5'
services:
  ignition:
    image: kcollins/ignition:8.1.3
    ports:
      - "8088:8088"
    stop_grace_period: 30s
    restart: always
    volumes:
      - data:/var/lib/ignition/data
      - ./modules:/modules
      - ./backup/gateway.gwbk:/restore.gwbk
    logging:
      driver: "json-file"
      options:
        max-size: "200k"
        max-file: "10"
    environment:
      TZ: "Europe/Oslo"
      IGNITION_EDITION: "FULL"
      GATEWAY_MODULES_ENABLED: vision,opc-ua,tag-historian,reporting
      GATEWAY_SYSTEM_NAME: GoodCloud
      GATEWAY_INIT_MEMORY: "1024"
      GATEWAY_MAX_MEMORY: "2048"
      IGNITION_COMMISSIONING_DELAY: "15"
      IGNITION_STARTUP_DELAY: "300"
      #GATEWAY_ADMIN_PASSWORD: ${PASSWORD}
      #IGNITION_LICENSE_KEY: "XXXX-XXXX"
    networks:
      Network1:
        aliases:
          - ignitioncloud
      Network2:
        aliases:
          - ignitioncloud
networks:
  Network1:
    external:
      name: KT_Truck
  Network2:
    external:
      name: KT_CloudLocal
volumes:
  data:

The 4 “Ignition-edge” nodes are using more or less the same compose file as the manager:

version: '3.5'
services:
  ignition:
    image: kcollins/ignition:8.1.3
    ports:
      - "8088:8088"
    stop_grace_period: 30s
    restart: always
    volumes:
      - data:/var/lib/ignition/data
      - ./modules:/modules
      - ./backup/gateway.gwbk:/restore.gwbk
    logging:
      driver: "json-file"
      options:
        max-size: "200k"
        max-file: "10"
    environment:
      TZ: "Europe/Oslo"
      IGNITION_EDITION: "EDGE"
      GATEWAY_MODULES_ENABLED: perspective,opc-ua,tag-historian,modbus-driver-v2
      GATEWAY_SYSTEM_NAME: KT_Truck_${index}
      GATEWAY_INIT_MEMORY: "1024"
      GATEWAY_MAX_MEMORY: "2048"
      IGNITION_COMMISSIONING_DELAY: "15"
      IGNITION_STARTUP_DELAY: "300"
      #IGNITION_LICENSE_KEY: "XXXX-XXXX"
#      GATEWAY_ADMIN_PASSWORD: ${PASSWORD}
    networks:
      Network1:
        aliases:
          - kttruck${index}
networks:
  Network1:
    external:
      name: KT_Truck
volumes:
  data:

I will take a look on those extra options you mentioned using extra_hosts: configuration option

kcollins1 · March 17, 2021, 4:10pm

I’ll do some testing, but I’m curious, any particular reason you’re not actually using Docker Swarm to deploy these (e.g. docker stack deploy ...)? (obviously you’d need some more work in the compose files, namely the deploy: option)

tordvd · March 17, 2021, 7:48pm

It was my understanding that using docker stack deploy would create a service in any arbitrary worker node. And if this is true, then how can I alias directly to truck1 Ignition installation, which is supposed to be located in “truck1 machine”.

Our Ignition edge “worker” nodes are similar but not equal. Demanding Ignition Designer access by service personell from time to time. So they must be located at the correct machine.

The (initial) reason we use swarm networking here. Is to utilize docker as well as overlay network. Which fits with the rest of our IOT topologi/scheme in other projects.

I guess an very efficient way of updating all swarm nodes would be to do Ignition changes centrally towards a simulator, then pushing this to a git repo. Then to automattically pull them to the correct worker nodes. But this level of programming/scripting might be difficult to maintain for me and our company at the moment.

Docker swarm is a new “platform” for us, and we are happy for any inputs you might have!

tordvd · March 18, 2021, 1:04pm

@kcollins1 Do you think it would be possible to look on your swarm compose files, and maybe the commands you use to start the containers? They can be started withouth swarm beeing online right? (my edge computers are sometimes in areas without network, which require the HMI to be visible)

kcollins1 · March 18, 2021, 1:52pm

If you want to be able to manage your container lifecycle without Swarm due to network instability, perhaps what you’re doing is the best method–I’d just never thought of trying to use Swarm only for the setup/maintenance of overlay networks. There certainly isn’t anything stopping you from adding overlay networks with --attachable for letting stand-alone containers connect and function. That said, you could still add the option to be able to deploy services via Swarm through the use of node labels and the deploy: configuration option and placement constraints. The variable I’m unsure of is what level of control you’d have without a functioning Swarm manager node; after all, that is the point of the “orchestration”… Having those aspects in your compose files shouldn’t affect stand-alone Docker Compose use though (deploy: is ignored with docker-compose, for example) and might make for easier multi-node simulation of your system in a dev cluster.