Ignition service on primary server hangs

I am using redundant Ignition servers and I am running into a problem where my Ignition on my primary server hangs.

When I view the status of the service on the primary server in a command line (service ignition status), it shows that it is running. I can not get to the gateway webpage on the primary, and the gateway webpage on the backup does not show the primary as running. The backup is active and the fail over works.

I am using Ignition 7.5, running on SuSE Enterprise Linux 11. The Linux servers are running as VM’s under ESX 5. They are located on separate physical servers.

This has been a very stable setup that we have been running for about a year. I have seen the service hang twice in the last week.

After the first time I went through the wrapper.log and tried to clear up any errors that I could.

This time there is the error:

INFO | jvm 1 | 2012/07/03 07:37:16 | Jul 3, 2012 7:37:16 AM org.apache.tomcat.util.net.JIoEndpoint$Acceptor run
INFO | jvm 1 | 2012/07/03 07:37:16 | SEVERE: Socket accept failed
INFO | jvm 1 | 2012/07/03 07:37:16 | java.net.SocketException: Too many open files
wrapper.zip (13.1 KB)

On the server with the problem, find the ignition process ID. In a terminal, do:

ps aux | grep ignition

Take the process id, run:

lsof -p 22118 > ~/Desktop/ignition_files.txt

Sub your actual process ID in where I used ‘22118’. Then upload/send that file in. (ignition_files.txt on your desktop)

Rebooting the server should get you running again.

Yes, when I stopped the ignition service and started it again the redundancy was back to normal. So when I ran lsof to get this file everything was running normally.
ignition_files.txt (1.3 KB)

Well if it happens again try and run the lsof thing again before you restart it, might help track down what’s going on.

Ok, thanks Kevin.

I have this problem again with same error. I ran the lsop command, but the output looks the same to me. I have not restarted the server or the ignition service yet, it is still in a hung state. Is there anything else I can try to narrow the problem down while it is still in this state?
wrapper.zip (60.3 KB)
ignition_files2.txt (1.3 KB)

I think that’s the wrong process - I need the PID of the java process running Ignition. I think you’ve got the process running the service wrapper.

Ok. I think I have a better idea of what the problem is now. When I ran the lsop for the Java process it looks like I have TCP connections that are not getting closed.

java 17222 root 1212u IPv6 31821398 0t0 TCP x.y.net:41581->mail.y.com:imap (CLOSE_WAIT)

I had written code to use a modified version of the python imap library to send alarm notifications and then hit up a mailbox for the reply to ack the alarm.

I am guessing that there is something different with the newer version of Python that is not agreeing with the code I am using.

Its my understanding that the imap library is now included with the Python files that come with Ignition, so I will see if I can use that.

Included, but untested. :smiley:

I changed my script so that it references the jython imap library that now comes with ignition. I tested the script in the script playground and it is working. However, I still have problems with the sockets not closing in a timely matter when I make a connection to the mail server. I was wondering if anyone has used this imap method before. I have attached the script.

On a side note I have noticed that ignition 7.5 does not play nice with older java versions (1.6 update 26) on our server. CPU utilization was way up to 75% with this combination. Once we updated java to 1.7 everything was happy and running at a much lower CPU utilization.
alarm notification.py (13.6 KB)

What is the printed message after you call M.logout()? Are you successfully getting the BYE response?

<bound method IMAP4.logout of <imaplib.IMAP4 instance at 0x687>>

I just noticed that I do not have () after close and logout. When I try a condensed version of that script in a command line with Jython I get the error above. It should be M.close() and M.logout().

So did calling the functions correctly fix the issue?

The close and logout return the proper responses now (ok, and bye). I still get a close_wait for a few of the connections to the mail server but nothing like it was before.

Ok, good to know. Glad it seems to be working for you now. Let us know if you start seeing more connection issues again.