miles wide miles deep: system monitor

Showing posts with label system monitor. Show all posts

Friday, December 22, 2006

how to coerce OMSA 5.1 to install & run properly on CentOS 4 (part III)

Here is the summary of what I did for OMSA 5.1 on CentOS 4.1 ( the white-box twin of RHAS 4.1)

For installation, on a CentOS 4.1 without OpenIPMI or net-snmp, you just need to

append /etc/redhat-release with "Nahant", RHEL 4's code name.

1c1
< CentOS release 4.1 (Final) Nahant --- > CentOS release 4.1 (Final)
2. start installation by running ./linux/suppportscripts/srvadmin-install.sh
After installation, answer NO to start start all services. Instead, conduct the following steps first:
1. insert a line 'test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0' at the beginning of /etc/init.d/dsm_sa_ipmi. w/o it, /etc/init.d/dsm_sa_ipmi will fail miserablly. It seems that udev doesn't put /dev/ipmi0 back as expected.
22,24d21 < # my hack < test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0 < #
2. insert '/etc/init.d/dsm_sa_ipmi start' at the beginning of 'start' section inside /usr/bin/srvadmin- services.sh
w/o it, srvadmin-services.sh start will fail with IPMI drivers fail to load. However, if you start IPMI manually by '/etc/init.d/dsm_sa_ipmi start', you'd be just fine.
241,243d240 < # my hack::ipmi failed to start when called from later this script < /etc/init.d/dsm_sa_ipmi start <
3. now start it all with /usr/bin/srvadmin-services.sh start
4. verify it with ' omreport chassis temps' and 'omreport storage controller '. These two rely on different things to work. I'll elaborate on this later.

The rc scripts are placed in the S50 instead of their own start sequence number mandated in the scripts. I corrected them manually with my understanding of chkconfig directives in the scripts. However, I didn't get to change run level to test whether that would guarantee OMSA startup succesfully, since all my boxens are in production. Besides, Hobbit Monitor will let me know soon enough upon such failure.

It is such a great relief to have constant & comprehensive monitoring against critical systems. As a system engineer, you don't have to make all these mental or paper notes to check this or that. A decent NMS will do its job to nag you (whether you like or not) when necessary.

Wednesday, December 20, 2006

how to change IP address on a Linux server :: a project?

I was asked this question once, "how do you change IP address on a Linux server?" At its face value, it's a simple technical question on systems administration or network administration using a Linux server. So, my instant response is

edit /etc/sysconfig/ifcfg-eth0 (or eth?). or system-config-network TUI or GUI.
adjust routes and personal fw on the host as needed
ifconfig eth1 down && ifconfig eth1 up (or /etc/init.d/network restart)
verify: ifconfig eth1 (to see if the new IP has taken effect)
check & verify from remote host it is actually accessible (TCP/IP level, plus other services the server may provide)
update DNS if applicable

To dwell on the question a little more, and depends on the context of the conversation, the lines of questions, and who asks, I would expand a bit from the perspective of process management, configuration management, and knowledge management. In other words, I'd prefer to drive such a "simple" change as a small project. Thus, here comes the addition to my initial answer:

So, the actual change is simple. Depends on the function of the server and its inter-relationship with other systems, you probably need to manage such a change as a project. As they say, a little bit more of thinking and planning goes a long way.

gather requirement from functional groups and management
set a date to cut-over and obtain sign-off from stake-holders
plan for it (procedures/fall-back plan/check-list/notification/etc.)
communicate the planned change and its potential impact to stake-holders and end users
dry run if necessary
prior to cut-over: for days leading up to the cut-over, shorten TTL for A record for that IP, if it has DNS record and is accessed by DNS name instead of IP alone.
prior to cut-over: check to be sure ACL or conduits or fw rules get updated on router/firewalls and systems the server gain access to or from
during cut-over: keep team and stake-holders abreast of up-to-date status
post cut-over: verify by going through your check-list to determine success
post cut-over: enter change log for this event & notify users the cut-over is done
post cut-over: alias or real NIC to keep servicing the IP for limited period. (could also just alias the new IP onto the NIC serving the current IP, the cut-over becomes switching who's alias and who's real)
post cut-over: added DNAT or reverse proxy rules to catch traffic to the old IP and redirect/log/alert as necessary
post cut-over: document this in the knowledge-management system: a scrapbook, a WIKI, a technical writer.

Monday, December 18, 2006

Hobbit Monitor :: how to report multiple temperture probe results

For Dell PowerEdge servers, OMSA can report temperature for multiple probes, notably, "BMC Ambient", "BMC Planaar", "BMC Riser", and one for each physical processor. I initially wrote my own extension script named 'temps' , which reports all probes under one 'status server1.temps green' call. It works for a test server after I did the following on the Hobbit server:

defined a RRD graph section in hobbitgraph.cfg for all probes available on that server
NCV_temps='*:GAUGE" in hobbitserver.cfg
appended temps=ncv to test2rrd line in hobbitserver.cfg
restart hobbitd server

When I deployed the same extension code to a different server, the limitation of such an extension became painfully obvious: the graph definition failed when a different server reports more or less probes. Since hobbitgraph.cfg's graph section is keyed to the test name in hobbitgraph.cfg, I can't add custom graph section per server configuration, not to mention it is not scalable!. As for alerting, I tried setting up thresholds on the server. However, it didn't seem to generate any alert even if the threshold is obviously surpassed. wonder if the 'status blah GREEN' reported by the extension script kinda blocked such centralized checking.

After posting the above questions to Hobbit's mailing list and got no answers, I took upon myself to find the answers. Good thing that Hobbit Monitor is licensed using GPLv1. I found the answers in the source code, rrd/do_temperature.c. [[ All hail goes to Open Source & Henrik! ]]

the 'temperature' test is built-in. duh...
if 'temperature' is used as 'test name' on the client, nothing needs to be changed on the server end.
a lump-all status command will do. Format of the data portion is somewhat restrictive. Each probe needs to be in the format of '&green BMCambient 17 62'.
the test can be done locally on the client and the overall $status is reported by 'status server1.temperature $status'.

Wishes:

It'd be nice to have such data report format documented elsewhere other than the source code.
The current code (4.2RC1-20060712) didn't take input other than integer too well. It confused the hell out of it, in fact, to the extent it lump the whole long after $status as the probe name. Most of problems I experienced was actually the '.' in my data report confused Hobbit.
It'd be nice to be able to specify threshold on the server centrally.

Wednesday, December 13, 2006

How to coerce OMSA 5.1 to install & run properly on CentOS 4 (part II)

Not so fast! The same fix doesn't work on CentOS 4.1 running on a Dell PE6850. srvadmin-install.sh somehow determined it needed a newer version of OpenIPMI from its own package, regardless of whether the OS has latest OpenIPMI installed or not. Recall that the same hack worked w/o OpenIPMI installed at all on CentOS 4.4.

# Here is the excerpt from srvadmin-install.sh
Server Administrator requires a newer version of the OpenIPMI driver modules
for the running kernel than are currently installed on the system. The OpenIPMI
driver modules for the running kernel will be upgraded by installing an
OpenIPMI driver RPM.
warning: RPMS/supportRPMS/openipmi-33.13.RHEL4-1dkms.noarch.rpm: V3 DSA signature: NOKEY, key ID 23b66a9d
Preparing... ########################################### [100%]
1:openipmi ########################################### [100%]

Loading kernel module source and prebuilt module binaries (if any)
Installing prebuilt kernel module binaries (if any)

Module build for the currently running kernel was skipped since the
kernel source for this kernel does not seem to be installed.

Your DKMS tree now includes:
openipmi, 33.13.RHEL4, 2.6.9-5.EL, i686: built
openipmi, 33.13.RHEL4, 2.6.9-5.EL, x86_64: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELsmp, i686: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELsmp, x86_64: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELhugemem, i686: built

The OpenIPMI packages is from the CentOS 4.4 release DVD. So, it couldn't be newer !
# rpm -aq|grep -i OpenIPMI
OpenIPMI-libs-1.4.14-1.4E.13
OpenIPMI-1.4.14-1.4E.13

All our production servers don't have kernel source or development kits available. So, dkms approach isn't the one we'd take easily if we have other viable alternatives. So, I modified srvadmin-openipmi.sh to force it not to upgrade IPMI.
# diff srvadmin-openipmi.sh.orig srvadmin-openipmi.sh
1055c1055,1056
< loc_recommended_action="$?
> #LOC_RECOMMENDED_ACTION=$?
> LOC_RECOMMENDED_ACTION=4

This worked for the installation. However, upon start-up the services, the IPMI services failed to start. I ran into problem before with OMSA's own ipmi init script (dsm_sa_ipmi) deleting /dev/ipmi0 while udev never puts it back. So, as a simple hack, I just added a line to add the device at the beginning of the script.
test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0

After this, /etc/init.d/dsm_sa_ipmi would start ok, w/o OpenIPMI package installed. However, when it is called from srvadmin-services.sh (I suspect instsvcdrv is the actually caller). It failed each every time, causing omreport could't get any data. For now, I can manually start dsm_sa_ipmi and then start/stop srvadmin-services.sh just fine.

What makes worse (maybe even the cause of this IPMI start-up problem) is that the init sequence numbers assigned to all these init scripts are S50, instead of S00(mptctl), S96(instsvcdrv), and S97 for all the rest. I don't think this will work upon reboot or run level changes. I fixed it manually per chkconfig directives inside the scripts. However, I am curious why chkconfig got confused.

P.S. A more complete tweak would be at beginning of the function in srvadmin-openipmi.sh. The previous tweak on srvadmin-openipmi.sh works only when you have OpenIPMI installed.

# diff srvadmin-openipmi.sh.orig srvadmin-openipmi.sh
973a974,976
> LOC_RECOMMENDED_ACTION=4
> return ${LOC_RECOMMENDED_ACTION}

How to coerce OMSA 5.1 to install & run properly on CentOS 4

To monitor Dell PowerEdge servers running Linux, I downloaded OMSA (Open Manage Server Assistant) 5.1 package for Redhat Enterprise Linux 4 (RHEL4) on a bunch of Dell PowerEdge 6850 servers. The package was listed under 'system management' of the 'drivers & download' section. The actual file name is 'OM_5.1_ManNode_LIN_A00.tar.gz'.

On the PE6850 server, I unpacked the tarball, and attempted to install using the srvadmin-install.sh. It failed silently, with an error code of 2.
# supportscripts/srvadmin-install.sh
# echo $?
2
Examing srvadmin-install.sh closely, I realized that it didn't support CentOS 4. Instead, it supports & detects RHEL4 by its code node 'Nahant'. Instead of mucking the code to support CentOS, I decided to fake RHEL4 by modifying /etc/redhat-release.
# cat /etc/redhat-release
fake RHEL 4 Nahant
It worked.The installation went a-ok and all services started properly. 'omreport storage controllers' now reports various information on various storage controllers.

Thinking the faking was needed only by the installation scripts, I reverted /etc/redhat-release to its original version. To be sure, I restarted OMSA by ' srvadmin-services.sh stop|start'. Well, it didn't pan out that well.

# /usr/bin/srvadmin-services.sh start
Starting mptctl:
Waiting for mptctl driver registration to complete:
[ OK ] Starting Systems Management Device Drivers:
Starting dell_rbu: [ OK ]
Starting ipmi driver: [FAILED]
Starting Systems Management Device Drivers:
Starting dell_rbu: Already started [ OK ]
Starting ipmi driver: [FAILED]
Starting DSM SA Shared Services: [ OK ]

Starting DSM SA Connection Service: [ OK ]

Now it seemed the srvadmin-ipmi package's init script still relies on /etc/redhat-release to tell it which OS it is running on. The server doesn't have OpenIPMI installed. Without any IPMI driver loaded, no report can be run against storage or temps using OMSA.

To have it work for both world (OMSA and CentOS), I cooked up this /etc/redhat-release by appending the RHEL4 code name to the original /etc/redhat-release
# cat /etc/redhat-release.fake
CentOS release 4.4 (Final) Nahant
# /usr/bin/srvadmin-services.sh start
Starting mptctl:
Waiting for mptctl driver registration to complete: [ OK ] Starting Systems Management Device Drivers:
Starting dell_rbu: [ OK ]
Starting ipmi driver: [ OK ]
Starting Systems Management Data Engine:
Starting dsm_sa_datamgr32d: [ OK ]
Starting dsm_sa_eventmgr32d: [ OK ]
Starting DSM SA Shared Services: [ OK ]
Starting DSM SA Connection Service: [ OK ]

This cooked /etc/redhat-release had been verified to work well with 'yum -y update' and OMSA's 'srvadmin-services.sh start'

Wednesday, August 09, 2006

inadvertent load test against Sybase ASE --- what did a Hobbit find

Last few weeks, I have been busy building up NMS for our site using Hobbit Monitor, an open-source NMS suite mimicking Bigbrother. For the most part, Hobbit Monitor (4.2-RC-20060712 w/o all-in-one patch) works out of the box for non-SELINUX server. I did my SELinux tweaks so it works on CentOS 4 w/o disabling SELinux. My job this week is mainly adding/customizing various health-checks, with some trials & errors here and there.
From time to time, for production database server running Sybase ASE 12.5.x on CentOS 4/i386, a few checks complained no data reported by Hobbit client. After a few days, I discerned the pattern. That is, [ports] and NMS servers's [hobbitd] are always bad when [files] [msgs] [procs] went CLEAR (no data, but considered OK by design) or PURPLE (non-reporting). Hobbitd's history page has 'client data' shows various truncation message. A post to the mailing list got answer from the author Henrik himself. I bumped up the size limit on client data/message from the default 256K to 1024K. Lo-and-behold, there are over 9000 TCP TIME_WAIT connections (~= new db connection) to the database engines.
Since TIME_WAIT is 2 x MSL (Maximum Segment lifetime), it means the application servers made over 9000 connections to the database server within a four-minute sliding window. Talking with our application architect turns fruitful. He said he has been contemplating to use pooled connection for some auxiliary processes. The main web applcation counts for only 20~40 connections (TCP ESTABLISHED) to the database since they are pooled. Recently in our performance lab we are testing how scalable Sybase ASE is for our web application. Now we sorta know. It can handle >= 2125 new database connections per minute, or 3 millon a day :)

Saturday, August 05, 2006

Hobbit monitor :: a nice BigBrother replacement

it took me a bit to hunt down the good community of Hobbits, and it is well worthy. Hobbit Monitor's sourceforge.net site is not kept up-2-date. Wonder whether the wrong impression people get off the SF site misled people to believe it is not maintained up-to-date. The Hobbit Monitor seems to be nice & functional Big Brother replacement. The active development and the author responds to mailing-list question and code/attach patch while discussing a bug with an end-user, is reminiscent of the early open-source days or early Internet days. I definitely like the ACK feature and the zoom-in on history and trend graph (guess it'd be part of the rrdtool function as well).