Ramblings of a Data Junkie

Some tidbits about open source system monitoring that might someday make it into something cohesive.

30 Sep 2011	Nagios World Conference 2011. Many of my thoughts about trending and graphing are rolled up in the presentation (HTML, PDF) I made at the conference.

25 Feb 2011	The folks at omniIT have come up with some very nice monitoring tools with Reconnoiter (the open source monitoring platform) and Circonus (the paid service). After a year or so, it is still early days, but the plumbing looks robust and they have good aesthetic and operational sense, even for early alphas. Once this gets packaged and refined a bit I could see using this, even on my little installations (5 or 6 sites, each with 40-50 hosts and thousands of services). I just deployed another batch of guest virtual machines, this time for a cross-platform automated build/test cluster. Each is different from the others (OS type, OS version, arch), and each will be replicated numerous times. I've got an nrpe package in place so installing/configuring/updating the clients is easy, but I still have not figured out a nice way to make Nagios detect the new instances automagically, and stop fussing about the instances when they go away. Some instances will last quite awhile (years), others will come on-and-off line regularly (daily/weekly/monthly schedules), others will run awhile then go away permanently. It would be nice to be able to easily tell Nagios about these changes. Nagios' 'scheduled downtime' might work for managing this. For now I enable/disable blocks of templates in the Nagios configs. It works, but I sometimes make mistakes, and I'd like to be able to hand it off to someone else more easily. I have not yet run into monitoring resource limits. A few thousand checks are not enough to stress even the ancient hardware on which my Nagios installations reside.

06 Feb 2011	There is a fundamental conflict between the unix "each tool should do one job well" and the need for seamlessly integrated systems. I suppose that doing the former well makes doing the latter easier. Graphing of trend data in Nagios (or Icinga) can be problematic because the collection of data is tied to sensing state (OK, WARN, CRIT). Sometimes I just want to collect data, and other (most?) times I want to have triggers associated with the data I collect. Cacti and friends do the former, Nagios and friends do the latter. For example, I would like to monitor the temperature, battery charge, time remaining, load, and line voltage on a UPS that is installed on a solar-powered system. I could do all of the sampling in a single call using my check_apcupsd plugin - all of the data come back in the perfdata string. I can even write check_apcupsd so that I specify all of the WARN and CRIT levels (which might change over time) in a single call. However, if I do this then I lose the ability to get individual notifications from Nagios for each of the parameters I am monitoring. So here are some options for doing it in Nagios: single invocation of check_apcupsd to gather perfdata with warn/crit, return OK/WARN/CRIT/UNKNOWN based on aggregation of all values single invocation of check_apcupsd to gather perfdata with warn/crit, but always return OK status one invocation of check_apcupsd for each parameter for which I want notification, each with warn/crit specified, each returning OK/WARN/CRIT/UNKNOWN as appropriate do both 2 and 3, and enable graphing for 2 but not for 3. I face the same issue with sensors (multiple CPU and motherboard temperatures) and disk arrays. I suppose I can use a combination of the options above, its just not an out-of-the-box solution with Nagios (or any other system?) I could do all of the data collection in Cacti and all of the notifications in Nagios. But then I have two separate systems pounding on each system I am monitoring. I could have one system collect all the data, then have two separate systems that use the data, one to do graphing and another to provide notifications. What's a boy to do?

05 Feb 2011	How did I miss OpenNMS? OpenNMS is an awesome system. Unfortunately the UI, while fairly polished, is much too complicated. It would take me months to get used to the system before I would automatically filter out all of the extra cruft that is reported in each panel. And the graphing is nifty for one-offs, but not so good for gestalt. I watched the zenoss demos again. Someone at zenoss should take all of the videos down and put up new ones that illustrate some ease-of-use rather than illustrating poor design. How many steps does it take to create a graph? "The graph name doesn't^H^Ht matter"? And why go through the pain of creating a GUI when the GUI is far more complicated to use and maintain than plain text files? Apparently n2rrd is still alive. That one is even simpler than nagiosgraph, but it requires a great deal of out-of-the-box configuration. I really want these systems to work, but they're not there yet.

14 Dec 2010	I now manage 2 separate Nagios installations, each with about 40 hosts and 500 services. Thanks to nagiosgraph testing I have installed Nagios on redhat, fedora, centos, debian, and ubuntu systems, both from source and from packages. I have installed NRPE on various linux flavors and MacOSX, plus NSClient++ on winXP and win7. I'm still watching icinga. I figure they need another year to really stand on their own. Someone should de-uglify the icinga web interface. There are some nice parts in the new web interface (e.g. the interactivity of the map, some of the data views), but in general it is still way too much ui and javascript cruft and not enough focus-on-my-monitoring-data. At this point, data collection and graphing pretty much go together. I'm not sure whether that is necessarily good or bad, just the way it is. I've seen people ripping on RRD as the data storage, some saying that the data should live in a 'real' database. I think RRD is fantastic for this application, and very easy to manage. It's easier to delete a file than connect to a database, query for the location of the data set, remember the sql to delete it, then do the delete, then pray that you got the right table/row/whatever. For graphing there are basically three options: nagiosgraph pnp4nagios nagiosgrapher As far as i can tell, nagiosgrapher used to be open source, but is now a closed source commercial product. The demos i have seen look pretty nice for single graphs (zooming, panning, etc). Configuration looks pretty complicated. I'm not terribly keen about how they aggregate graphs or browse host/service combinations or configure sets of graphs for regular reporting/viewing/monitoring. pnp4nagios seems to have a lot of backing and looks pretty robust (as of 6 - the 4 stuff was a bit wobbly). But it is huge, and i do not care for all the extra javascript cruft, or all the extra libraries on which it depends. If you are doing huge sites or distributed nagios processing, pnp4nagios provides easier installation of the bits needed to save to databases and such (you can do this with nagiosgraph and nagiosgrapher, but you'll have to write a bit of the plumbing yourself with those instead of just following the pnp4nagios cookbook). nagiosgraph is by far the smallest graphing package. It is compact and has minimal dependencies - great for some old hardware and for minimal installs. Customization of graphs in nagiosgraph is parameterized whereas pnp4nagios uses a template-based approach. nagiosgraph is the first (only?) RRD-based nagios grapher with in-place zooming and data/time feedback when mousing over each graph. It makes for easy browsing and isolation of data. After evaluating the graphing alternatives in 2009, I decided to work on nagiosgraph because I think there needs to be some (solid) competition between collection/graphing alternatives. and I wanted graph zooming that happens in-place, not in yet another clunky window :)

16 Mar 2010	Nagios is ugly but functional. Great for tinkering, not so great for just-let-me-monitor-my-standard-systems. With some minor adjustments it could be fantastic for the small business, home networking segment. debian (and other?) packaging helps with the out-of-box experience, at least on the client end of things. What is it needed? Apply action to multiple hosts. Apply action to multiple services. Paging of long host/service lists. Customizable dashboards. built-in graphing. icinga - all the 'we are a happy community' cheerleading is annoying. open dev process will suck in a cadre of dedicated users. prolly will find its own in a year or two. that will happen when it adds features that are not in nagios, and when it breaks nagios compatibility. Munin is all about graphing. Lousy for notifications. Lousy presentation. Zenoss packaging shows promise, but too many clicks involved. Still suffers from ajax cruft-itis, and even though you configure it via a web browser, it is just as much pain as doing it through a config file. Too many clicks and all kinds of hidden constraints/magic. Graph names don't matter either? nagiosgraph still for small installations and tinkerers, really needs templates. pnp4nagios is getting better, but still too much 'see the nifty ajax' rather than 'these are the patterns you need to track your systems'. and too many dependencies. old nagiosgrapher is about like nagiosgraph. new nagiosgrapher offers more graphing features, but still requires just as much tweaking as any other. all of the graphing options use the same mechanism to do their thing, so it all boils down to (1) how good is the presentation, (2) how well are they configured out-of-the-box, and (3) how easy is it to customize and extend the graphs once it is running. Here are a few things they all miss: concise display of information. my devices and services are more important than your gradients and titlebars. easy way to deploy/update configurations to target hosts, regardless of platform graphing - how to display graphs so one can easily go from forest to trees and back graphing - easy out of box, easy to customize, easy to change once deployed mechanism to create custom dashboard, but not at the expense of standard ways of slicing and reporting the data Most of the ajaxy solutions show their youth. It may be possible to create three instances of a query, but why? Eventually someone will create a few new features that go beyond Nagios, everyone will incorporate those, the UIs will settle into usable patterns, and the cruft will go away (remember nagios 3D network maps?). Right now (March 2010) everyone is aping Nagios for a user interface until they get their ajax plumbing to work reliably enough to spend time thinking about how people should actually use the interfaces. The ultimate package should support both admin mechanisms - web based or command-line/text-config-files. Sometimes the browser is the right tool, sometimes the command-line is the right tool. Systems should do standard reports/listings (e.g. most of the basic stuff in Nagios, plus a few other views). They provide a dashboard area in which one can combine subsets of the standard listings. Some dashboards have lifespan of a few minutes, some have lifespan of a few days (e.g. while tracking down a transient problem). Some live forever (e.g. query for specific network indicators or critical hosts/services). It is good that there are multiple options. And I'm not even considering derivative works (OpsView, Netways, etc). Disclaimer: I have installed (from tarball) and configured (manually) nagios for 30+ hosts and 300+ services. I have contributed a boatload of PERL and JavaScript to nagiosgraph. I have written CSS and created PNG/GIF images for the Exfoliation skin for Nagios.

16 Jan 2010	My initial impressions of open source monitoring: Nagios is ugly out-of-the-box, but it works reliably and it is easy enough to clean up. It is really nice to be able to run the nagios monitors from the command line to test them. Configuring nagiosgraph is a major pain, but once it is running no worries. Alan Brenner is working on it. PNP looks pretty good, but I did not want to install all of its dependencies on the web server on which I am running nagios/nagiosgraph. I used Cacti for years, but I found it unreliable for monitoring and I found the user interface too busy. It has gotten better, but it still has too much extra cruft in the user interface. There are a few other AJAXy projects out there now that try to replace nagios. They look nice, but they put too much junk between me and the machine. Smokeping is awesome, but too specialized. Perhaps I'll put some smokeping-iness into nagiosgraph for the ping plots... Automatic discovery would be really nice. In lieu of that, when configuring integrated systems such as network monitoring tools, give me text files rather than mouse clicks. Of course the opposite is usually true when trending, monitoring, and diagnosing.