Observium alerts not working
I’ve been running an instance of Observium for a couple of years now. I had set it up for the VPS which I’m running this blog on or for my home NAS, it’s really nice to be able to see how CPU, memory, storage, temperature and other indicators evolve over time. I also went ahead to set up monitoring agents for Apache or Postfix to see stuff like how large the mail queue is or how many http requests get served on average.
This has been all going well with the Community Edition for a long time. But now, I wanted to get the advanced features like being able to send alerts when disks are getting full or when CPUs get utilized too much all of sudden. The paid edition of Observium allows you to send alert messages to various destinations, not only email, but also to Pushover (instant message to your phone) or other popular platforms like Slack or Telegram, if you’re using those. And part of my motivation to purchase the Professional edition was that I wanted to support the authors because I like this software and would like to see it continue in the future.
However, to my dismay, after I migrated to the paid edition and set up some alert checkers to see how it worked, it didn’t work at all. Or, to be more precise, the alerts in Observium worked just fine, they showed up on the alerts page, but I never got any notification going no matter what channel I used. I was at a loss as to why it wasn’t working because I didn’t have a good idea about how Observium was designed. At first, I was considering filing a support ticket to the Observium team but then I thought I should probably explore it a little bit by myself first because chances were, if it worked for other people, the chances were it was some typo or misconfiguration on my end.
I checked my configuration, made sure that the database schema was the right one (in Jira tickets I could find, wrong database schema was one of the common reasons for alert notifications not working). But still no luck, no clues in the logs either.
Then I discovered there was this test script for alerts test_alert.php and when I ran it like:
test_alert.php -d -a 42
the alert message actually was sent out, and it worked for all notification transports that I had set up.
So it was obvious that the issue was not in some wrong SMTP or webhook notification setting, but that it was somewhere else. I suspected maybe some discrepancy between the FPM and CLI version of php, but it didn’t get me anywhere. But it, at least, finally got me on the right track.
I started to look at how the Observium jobs are run. I vaguely remembered that whenever I was updating the Community Edition once a year or something like that, I would need to first stop cronjobs. This one was the main poller job that was gathering statistics for Observium, running every five minutes:
#cat /etc/cron.d/observium
*/5 * * * * user /opt/observium/poller.php -h all >> /dev/null 2>&1
I was curious if the job was running into any issues so I stopped it in cron and ran the very same job myself. It did finish all right, no errors reported. But it took quite a long time, about 5 mins and 30 seconds before all the polling was done. That was a red flag.
When I let cron run the job every five minutes again, I could see in the process list that it would always be the case that the jobs would overlap for about 30 seconds because another one was kicked off after five minutes while the previous one wasn’t completed yet.
Just for the context, I’m not monitoring that many devices. But the hardware I’m running this on is pretty old by now. There’s an old Atom processor from 2010 in this machine, I guess today’s phones have CPUs that are more powerful than this. But I really like this machine, it has been really useful for various interesting stuff (like Observium), and also very undemanding. According to the filesystem, I installed some version of Debian on it 11 years ago:
# tune2fs -l /dev/mapper/machine-root | grep created
Filesystem created: Tue Jun 28 23:27:59 2011
and it’s been running quietly in the corner ever since, I just bump the major version of the OS every couple of years. Once I had to replace a failing disk and once I had to replace faulty power source, but that was it.
Anyway, the point is that it was running Observium jobs for such a long time that they would overlap each other. I wanted to see if the problem with alerts not being sent would be stemming from this. Luckily for me, there is now a better way to run Observium jobs right in the Observium folder: poller-wrapper.py. It’s a python script that schedules Observium jobs using a multi-threaded approach rather than running them sequentially. I noticed in the Observium documentation that it was now the recommended way of running these jobs, but I guess that when I was installing Observium in something like 2014, I set up cron in the way it was recommended in the documentation back then and it just stayed the way it was ever since because it was not interfering with anything else, until now.
I first tried running this python wrapper manually and it completed all the poller jobs in about a minute. So I updated the cron entry to use this wrapper instead:
*/5 * * * * user /opt/observium/observium-wrapper poller >> /dev/null 2>&1
(observium-wrapper is a link to the python script) and voila, the alert notification suddenly started to be sent out!
I’m not really sure why the overlap was causing this, I didn’t delve into it, I was satisfied with the fact that it worked.