Alerting emails stop unexpectedly

Duffanator · August 8, 2012, 10:37pm

Hey guys,

Last week all of my email alerts stopped getting sent out for some reason, I checked what I know to check and couldn’t find anything obvious. I restarted the gateway and the e-mails started working again for a few days and then stopped again. I’m not sure what’s going on with it as I haven’t added any new e-mail alerts lately.

What should I be looking for to try and identify the problem? It’s hard to tell when it stops working because everything else looks fine. Any help to point me in the right direction would be much appreciated, thanks!

Michael.Stofan · August 8, 2012, 10:49pm

What version of Ignition are you running? How many emails are going out in a given day (ballpark)?

First thing I’d recommend checking is the console on the gateway. Post up the console export if you can so I can take a look.

Duffanator · August 8, 2012, 10:52pm

I’m running 7.5.1. I would say maybe 30 to 40 e-mails a day total (depends on the day and how the plant is running). I’ll post a excerpt when I get back to work. I’m just not sure what I should be looking for.

Michael.Stofan · August 8, 2012, 11:11pm

One more thing: grabbing a thread dump while the emails aren’t working would also be helpful.

Go to the configuration page --> console --> threads tab

Scroll all the way to the bottom and click the ‘Thread Dump’ button and post up the resulting text file. Again, make sure to do this while the problem is occurring.

Duffanator · August 9, 2012, 11:41am

Here is a thread dump from the gateway:

ThreadDump20120809.txt (104 KB)

and here is a log export:

logs.bin.gz (398 KB)

Though I don’t know how much help it will be because it’s full of errors for sparkline charts with indirect tag history (which I have another problem report in for here.

Colby.Clegg · August 9, 2012, 4:19pm

Hi,

Yes, not much there. Let’s try this: Go to Console>Levels, and search for “Alerting.Notification”. There will be a sub logger for each email profile. Turn them to “all”.

To help keep things clear, you might disable that tag history logger for now. Search for “DataLoader”, and turn the “IgnitionDB_Loader” to “fatal”.

Note: These settings will revert on a gateway restart.

With that set, you should see the following things:

Every active alert that passes the initial checks (ie. pretty much all alerts after startup) will result in a “Profile X received alert: Y”.
If email addresses found: “Scheduling email of message…”
If not: “Not emailing, no addresses found”.

You may also want to set the logger “Alerting.AlertBus” to “all” as well.

With all that, we should be able to narrow down where the problem might be occurring. The thread dump might still be useful, as long as you know for a fact that the email system isn’t working. In the logs, this might look like messages from alertbus, but not from the notification profile.

Regards,

Duffanator · August 9, 2012, 7:26pm

Ok, I got some logs of the alerting stuff:

logs.bin.gz (431 KB)

I don’t see any errors though. One thing I noticed is that at 3:14:09 PM an alarm came in that should have been e-mailed out but it says that no addresses could be found. I checked my alert groups and contacts and everything is there in the database, here is my alert group setting that would apply to that alarm:

and the contacts that are assigned to that group are there as well. What should I look for now?

Michael.Stofan · August 10, 2012, 5:56pm

Are you running replication on your databases by chance?

Duffanator · August 10, 2012, 7:42pm

Our database administrator says that we do not run replication on that database.

Colby.Clegg · August 10, 2012, 8:52pm

And your database connection in Ignition isn’t configured for failover? I can’t see any immediate problem with your expression, but unfortunately the log messages don’t include the severity. Are those tags set for “High” severity?

The questions about failover & replication pertain to the fact that the distribution groups get reloaded often, and if the database had failed over, it’s possible that the current target might not have them for some reason. However, the UI in the gateway also uses the same data, so since they show up there, I can’t imagine it’s that (unless some sort of error is happening on the load that we’re not seeing).

So, to be clear: these distro groups work with these alarms (say “SQLTags.default.Kill Floor Main PLC/Clean Side Stops/Dress Stop Manual Split Station.Dressing Line Stopped at Manual Split Station” for example), and then after some days stops?

Regards,

diat150 · August 10, 2012, 9:21pm

Ive seen the same thing in 7.2. with no database failover or database replication.

Just like the OP says, the log says that there are no contacts for the alarm but of course that isnt true. Every once in a while the system that I was administering would have a database fault, everything would recover fine but I never could pin down if the two were related.

We ended up setting up an alarm that would email out every 30 minutes and then we setup a system that would check the inbox and verify that the alarm emails would make it in, if not it would notify us. I think instead of restarting the system, all I would do is just go to the alert notification profile config on the webpage and save it, and then everything would work correctly again.

Duffanator · August 10, 2012, 9:44pm

Yes, this is correct. It will work fine sending out e-mails and then all of a sudden after maybe a day or two it stops sending out the e-mails. All of the alarms still show up in the alarm tables and get saved into the ALERT_LOG database table but the e-mails don't go out. After a gateway restart it will work again for a few days and then stop.

This started happening after the upgrade from 7.5.0 to 7.5.1 if that helps at all. E-mails worked without a problem for me from 7.2.5 (my original install) to 7.5.0.

We do not have a failover database, however our database is in a MSSQL cluster on virtual servers. But still, there is only "one" database that Ignition connects to for alert data.

Colby.Clegg · August 10, 2012, 10:07pm

Ok, thanks for the info. Yes, this has definitely been something of a ghost problem that has popped up now and then, and I haven’t been able to find any clues that ties everything together. First thing we should probably do is to stop reloading all of the groups every minute, as there could be something going on with the expression initialization, and with the current system we’re simply rebuilding the same thing over and over again. The idea is that the tables might be edited from the outside, but in reality this is a minor use case. We could probably make it an option to have it poll.

I’ll try to also take another look at the error handling/logging in there, and try to get something put together on monday.

Regards,

Duffanator · September 24, 2012, 4:12pm

FYI, I think I fixed this by putting the mail server DNS NAME in the address setting field instead of the server IP address. Ever since I did that (about a week ago) it’s been working fine. I don’t know if that makes sense to you? I’ll let you know if I run into any more problems with it.

dave.fogle · September 25, 2012, 6:09pm

I'm not sure that this would have actually made any difference. There was a change that was implemented in 7.5.3 that was meant to address this issue. Is is possible you upgraded to 7.5.3 around the same time changed the IP address to the DNS name in the notification settings?

Duffanator · September 25, 2012, 6:14pm

I did upgrade to 7.5.3 over the weekend but I made the DNS name change a week before that (then I was running 7.5.1). At any rate, it’s been working fine so far. I just wanted to let you know what I changed so you were aware. Thanks!