Monitoring and Other Mayhem

April 18, 2016

Yeah, it was an ugly day today. We got a few inches of rain (complete with wild tunderstorms) in just a couple of hours, and the beautiful river I live on has turned to mud as a result. A perfectly cruddy end to a perfectly cruddy week.

At least I got something accomplished this weekend, though…

The Monitor Maze

We all like to have visibility into the systems we run, and I’m no different, so this weekend saw the first whack at a monitoring solution for floating.io. Historically I’ve used Nagios for this as it tends to be very simple, but I decided it was time to move on. Not only is the Nagios interface truly ugly (unless you pay for Nagios IX, and maybe even then), but it requires a bag on the side to handle metrics gathering and reporting.

I wanted something better.

I’m tired of monitoring solutions that don’t do it all in one package. To me, capacity planning is part and parcel of monitoring, and it amazes me that the core Nagios package doesn’t have support for it out of the box. And what they do have is quite ugly. I like a comfortable interface, and nothing about Nagios is comfortable.

Well, except the check scripts. I like those.

And so I turned to Google and did some digging. There are, of course, all the various cloud-based monitoring solutions (including Amazon’s own CloudWatch), but I don’t see the need to pay money for this. Not to mention, SaaS solutions tend to be inflexible, and involve sending information outside my own network, which I dislike. Running servers in the cloud is bad enough without deliberately punching a bunch more holes in my security.

The Options

Most of the “good” monitoring solutions are, sadly, commercial. Simply put, I’m not paying money to monitor my personal web site. It’s a waste of funds. Let’s be honest here: if I just wanted basic “it’s down, doofus!” monitoring, a few cron jobs would suffice. Since I don’t actually need more than that, spending money on it isn’t the solution.

That leaves us with the open source stuff, which is a bit of a mixed bag.

Nagios is, of course, the go to solution. I was originally going to go that route, but historical trend analysis is one of the reasons I want the thing, and Nagios just doesn’t do that. Yes, there’s PNP4Nagios and a number of other tools, but I’d really rather use a single, well-integrated solution with a nice UI.

I thought about collectd, and even looked into it a bit, but it seems to be just another component in a build-it-yourself solution. Not bad (especially when combined with Graphite), but again, multiple tools. Not what I’m looking for.

Thought about the old standby (MRTG or PRTG), but that has the opposite problem: no real alerting.

In the end, I only found one solution that really seemed to stand out. It was available packaged as an RPM — good start — and had monitoring, alerting, and trending. It supported both agent-based and SNMP-based data collection. It was flexible, and (at first glance) had a decent UI that integrates it all.

So I decided to try out Zabbix.