Some tidbits about open source system monitoring that might someday make it into something cohesive.
|30 Sep 2011||Nagios World Conference 2011. Many of my thoughts about trending and graphing are rolled up in the presentation (HTML, PDF) I made at the conference.|
|25 Feb 2011||
The folks at omniIT have come up with some very nice monitoring tools with Reconnoiter (the open source monitoring platform) and Circonus (the paid service). After a year or so, it is still early days, but the plumbing looks robust and they have good aesthetic and operational sense, even for early alphas. Once this gets packaged and refined a bit I could see using this, even on my little installations (5 or 6 sites, each with 40-50 hosts and thousands of services).
I just deployed another batch of guest virtual machines, this time for a cross-platform automated build/test cluster. Each is different from the others (OS type, OS version, arch), and each will be replicated numerous times. I've got an nrpe package in place so installing/configuring/updating the clients is easy, but I still have not figured out a nice way to make Nagios detect the new instances automagically, and stop fussing about the instances when they go away. Some instances will last quite awhile (years), others will come on-and-off line regularly (daily/weekly/monthly schedules), others will run awhile then go away permanently. It would be nice to be able to easily tell Nagios about these changes. Nagios' 'scheduled downtime' might work for managing this. For now I enable/disable blocks of templates in the Nagios configs. It works, but I sometimes make mistakes, and I'd like to be able to hand it off to someone else more easily.
I have not yet run into monitoring resource limits. A few thousand checks are not enough to stress even the ancient hardware on which my Nagios installations reside.
|06 Feb 2011||
There is a fundamental conflict between the unix "each tool should do one job well" and the need for seamlessly integrated systems. I suppose that doing the former well makes doing the latter easier.
Graphing of trend data in Nagios (or Icinga) can be problematic because the collection of data is tied to sensing state (OK, WARN, CRIT). Sometimes I just want to collect data, and other (most?) times I want to have triggers associated with the data I collect. Cacti and friends do the former, Nagios and friends do the latter.
For example, I would like to monitor the temperature, battery charge, time remaining, load, and line voltage on a UPS that is installed on a solar-powered system. I could do all of the sampling in a single call using my check_apcupsd plugin - all of the data come back in the perfdata string. I can even write check_apcupsd so that I specify all of the WARN and CRIT levels (which might change over time) in a single call. However, if I do this then I lose the ability to get individual notifications from Nagios for each of the parameters I am monitoring.
So here are some options for doing it in Nagios:
I face the same issue with sensors (multiple CPU and motherboard temperatures) and disk arrays. I suppose I can use a combination of the options above, its just not an out-of-the-box solution with Nagios (or any other system?)
I could do all of the data collection in Cacti and all of the notifications in Nagios. But then I have two separate systems pounding on each system I am monitoring. I could have one system collect all the data, then have two separate systems that use the data, one to do graphing and another to provide notifications. What's a boy to do?
|05 Feb 2011||
How did I miss OpenNMS? OpenNMS is an awesome system. Unfortunately the UI, while fairly polished, is much too complicated. It would take me months to get used to the system before I would automatically filter out all of the extra cruft that is reported in each panel. And the graphing is nifty for one-offs, but not so good for gestalt.
I watched the zenoss demos again. Someone at zenoss should take all of the videos down and put up new ones that illustrate some ease-of-use rather than illustrating poor design. How many steps does it take to create a graph? "The graph name doesn't^H^Ht matter"? And why go through the pain of creating a GUI when the GUI is far more complicated to use and maintain than plain text files?
Apparently n2rrd is still alive. That one is even simpler than nagiosgraph, but it requires a great deal of out-of-the-box configuration.
I really want these systems to work, but they're not there yet.
|14 Dec 2010||
I now manage 2 separate Nagios installations, each with about 40 hosts and 500 services. Thanks to nagiosgraph testing I have installed Nagios on redhat, fedora, centos, debian, and ubuntu systems, both from source and from packages. I have installed NRPE on various linux flavors and MacOSX, plus NSClient++ on winXP and win7.
At this point, data collection and graphing pretty much go together. I'm not sure whether that is necessarily good or bad, just the way it is.
I've seen people ripping on RRD as the data storage, some saying that the data should live in a 'real' database. I think RRD is fantastic for this application, and very easy to manage. It's easier to delete a file than connect to a database, query for the location of the data set, remember the sql to delete it, then do the delete, then pray that you got the right table/row/whatever.
For graphing there are basically three options:
As far as i can tell, nagiosgrapher used to be open source, but is now a closed source commercial product. The demos i have seen look pretty nice for single graphs (zooming, panning, etc). Configuration looks pretty complicated. I'm not terribly keen about how they aggregate graphs or browse host/service combinations or configure sets of graphs for regular reporting/viewing/monitoring.
nagiosgraph is by far the smallest graphing package. It is compact and has minimal dependencies - great for some old hardware and for minimal installs. Customization of graphs in nagiosgraph is parameterized whereas pnp4nagios uses a template-based approach. nagiosgraph is the first (only?) RRD-based nagios grapher with in-place zooming and data/time feedback when mousing over each graph. It makes for easy browsing and isolation of data.
After evaluating the graphing alternatives in 2009, I decided to work on nagiosgraph because I think there needs to be some (solid) competition between collection/graphing alternatives. and I wanted graph zooming that happens in-place, not in yet another clunky window :)
|16 Mar 2010||
Nagios is ugly but functional. Great for tinkering, not so great for just-let-me-monitor-my-standard-systems. With some minor adjustments it could be fantastic for the small business, home networking segment. debian (and other?) packaging helps with the out-of-box experience, at least on the client end of things.
What is it needed? Apply action to multiple hosts. Apply action to multiple services. Paging of long host/service lists. Customizable dashboards. built-in graphing.
icinga - all the 'we are a happy community' cheerleading is annoying. open dev process will suck in a cadre of dedicated users. prolly will find its own in a year or two. that will happen when it adds features that are not in nagios, and when it breaks nagios compatibility.
Munin is all about graphing. Lousy for notifications. Lousy presentation.
Zenoss packaging shows promise, but too many clicks involved. Still suffers from ajax cruft-itis, and even though you configure it via a web browser, it is just as much pain as doing it through a config file. Too many clicks and all kinds of hidden constraints/magic. Graph names don't matter either?
nagiosgraph still for small installations and tinkerers, really needs templates. pnp4nagios is getting better, but still too much 'see the nifty ajax' rather than 'these are the patterns you need to track your systems'. and too many dependencies. old nagiosgrapher is about like nagiosgraph. new nagiosgrapher offers more graphing features, but still requires just as much tweaking as any other. all of the graphing options use the same mechanism to do their thing, so it all boils down to (1) how good is the presentation, (2) how well are they configured out-of-the-box, and (3) how easy is it to customize and extend the graphs once it is running.
Here are a few things they all miss:
Most of the ajaxy solutions show their youth. It may be possible to create three instances of a query, but why? Eventually someone will create a few new features that go beyond Nagios, everyone will incorporate those, the UIs will settle into usable patterns, and the cruft will go away (remember nagios 3D network maps?). Right now (March 2010) everyone is aping Nagios for a user interface until they get their ajax plumbing to work reliably enough to spend time thinking about how people should actually use the interfaces.
The ultimate package should support both admin mechanisms - web based or command-line/text-config-files. Sometimes the browser is the right tool, sometimes the command-line is the right tool.
Systems should do standard reports/listings (e.g. most of the basic stuff in Nagios, plus a few other views). They provide a dashboard area in which one can combine subsets of the standard listings. Some dashboards have lifespan of a few minutes, some have lifespan of a few days (e.g. while tracking down a transient problem). Some live forever (e.g. query for specific network indicators or critical hosts/services).
It is good that there are multiple options. And I'm not even considering derivative works (OpsView, Netways, etc).
|16 Jan 2010||
My initial impressions of open source monitoring:
Copyright © 2010-2011 Matthew Wall, all rights reserved