Monday, December 18, 2006

Hobbit Monitor :: how to report multiple temperture probe results

For Dell PowerEdge servers, OMSA can report temperature for multiple probes, notably, "BMC Ambient", "BMC Planaar", "BMC Riser", and one for each physical processor. I initially wrote my own extension script named 'temps' , which reports all probes under one 'status server1.temps green' call. It works for a test server after I did the following on the Hobbit server:
  • defined a RRD graph section in hobbitgraph.cfg for all probes available on that server
  • NCV_temps='*:GAUGE" in hobbitserver.cfg
  • appended temps=ncv to test2rrd line in hobbitserver.cfg
  • restart hobbitd server
When I deployed the same extension code to a different server, the limitation of such an extension became painfully obvious: the graph definition failed when a different server reports more or less probes. Since hobbitgraph.cfg's graph section is keyed to the test name in hobbitgraph.cfg, I can't add custom graph section per server configuration, not to mention it is not scalable!. As for alerting, I tried setting up thresholds on the server. However, it didn't seem to generate any alert even if the threshold is obviously surpassed. wonder if the 'status blah GREEN' reported by the extension script kinda blocked such centralized checking.

After posting the above questions to Hobbit's mailing list and got no answers, I took upon myself to find the answers. Good thing that Hobbit Monitor is licensed using GPLv1. I found the answers in the source code, rrd/do_temperature.c. [[ All hail goes to Open Source & Henrik! ]]
  • the 'temperature' test is built-in. duh...
  • if  'temperature' is used as 'test name' on the client, nothing needs to be changed on the server end.
  • a lump-all status command will do. Format of the data portion is somewhat restrictive. Each probe needs to be in the format of '&green BMCambient 17 62'. 
  • the test can be done locally on the client and the overall $status is reported by 'status server1.temperature $status'.
Wishes:
  • It'd be nice to have such data report format documented elsewhere other than the source code.
  • The current code (4.2RC1-20060712) didn't take input other than integer too well. It confused the hell out of it, in fact, to the extent it lump the whole long after $status as the probe name. Most of problems I experienced was actually the '.' in my data report confused Hobbit.
  • It'd be nice to be able to specify threshold on the server centrally.

No comments: