miles wide miles deep: December 2006

Friday, December 22, 2006

how to coerce OMSA 5.1 to install & run properly on CentOS 4 (part III)

Here is the summary of what I did for OMSA 5.1 on CentOS 4.1 ( the white-box twin of RHAS 4.1)

For installation, on a CentOS 4.1 without OpenIPMI or net-snmp, you just need to

append /etc/redhat-release with "Nahant", RHEL 4's code name.

1c1
< CentOS release 4.1 (Final) Nahant --- > CentOS release 4.1 (Final)
2. start installation by running ./linux/suppportscripts/srvadmin-install.sh
After installation, answer NO to start start all services. Instead, conduct the following steps first:
1. insert a line 'test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0' at the beginning of /etc/init.d/dsm_sa_ipmi. w/o it, /etc/init.d/dsm_sa_ipmi will fail miserablly. It seems that udev doesn't put /dev/ipmi0 back as expected.
22,24d21 < # my hack < test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0 < #
2. insert '/etc/init.d/dsm_sa_ipmi start' at the beginning of 'start' section inside /usr/bin/srvadmin- services.sh
w/o it, srvadmin-services.sh start will fail with IPMI drivers fail to load. However, if you start IPMI manually by '/etc/init.d/dsm_sa_ipmi start', you'd be just fine.
241,243d240 < # my hack::ipmi failed to start when called from later this script < /etc/init.d/dsm_sa_ipmi start <
3. now start it all with /usr/bin/srvadmin-services.sh start
4. verify it with ' omreport chassis temps' and 'omreport storage controller '. These two rely on different things to work. I'll elaborate on this later.

The rc scripts are placed in the S50 instead of their own start sequence number mandated in the scripts. I corrected them manually with my understanding of chkconfig directives in the scripts. However, I didn't get to change run level to test whether that would guarantee OMSA startup succesfully, since all my boxens are in production. Besides, Hobbit Monitor will let me know soon enough upon such failure.

It is such a great relief to have constant & comprehensive monitoring against critical systems. As a system engineer, you don't have to make all these mental or paper notes to check this or that. A decent NMS will do its job to nag you (whether you like or not) when necessary.

Wednesday, December 20, 2006

how to change IP address on a Linux server :: a project?

I was asked this question once, "how do you change IP address on a Linux server?" At its face value, it's a simple technical question on systems administration or network administration using a Linux server. So, my instant response is

edit /etc/sysconfig/ifcfg-eth0 (or eth?). or system-config-network TUI or GUI.
adjust routes and personal fw on the host as needed
ifconfig eth1 down && ifconfig eth1 up (or /etc/init.d/network restart)
verify: ifconfig eth1 (to see if the new IP has taken effect)
check & verify from remote host it is actually accessible (TCP/IP level, plus other services the server may provide)
update DNS if applicable

To dwell on the question a little more, and depends on the context of the conversation, the lines of questions, and who asks, I would expand a bit from the perspective of process management, configuration management, and knowledge management. In other words, I'd prefer to drive such a "simple" change as a small project. Thus, here comes the addition to my initial answer:

So, the actual change is simple. Depends on the function of the server and its inter-relationship with other systems, you probably need to manage such a change as a project. As they say, a little bit more of thinking and planning goes a long way.

gather requirement from functional groups and management
set a date to cut-over and obtain sign-off from stake-holders
plan for it (procedures/fall-back plan/check-list/notification/etc.)
communicate the planned change and its potential impact to stake-holders and end users
dry run if necessary
prior to cut-over: for days leading up to the cut-over, shorten TTL for A record for that IP, if it has DNS record and is accessed by DNS name instead of IP alone.
prior to cut-over: check to be sure ACL or conduits or fw rules get updated on router/firewalls and systems the server gain access to or from
during cut-over: keep team and stake-holders abreast of up-to-date status
post cut-over: verify by going through your check-list to determine success
post cut-over: enter change log for this event & notify users the cut-over is done
post cut-over: alias or real NIC to keep servicing the IP for limited period. (could also just alias the new IP onto the NIC serving the current IP, the cut-over becomes switching who's alias and who's real)
post cut-over: added DNAT or reverse proxy rules to catch traffic to the old IP and redirect/log/alert as necessary
post cut-over: document this in the knowledge-management system: a scrapbook, a WIKI, a technical writer.

Monday, December 18, 2006

Hobbit Monitor :: how to report multiple temperture probe results

For Dell PowerEdge servers, OMSA can report temperature for multiple probes, notably, "BMC Ambient", "BMC Planaar", "BMC Riser", and one for each physical processor. I initially wrote my own extension script named 'temps' , which reports all probes under one 'status server1.temps green' call. It works for a test server after I did the following on the Hobbit server:

defined a RRD graph section in hobbitgraph.cfg for all probes available on that server
NCV_temps='*:GAUGE" in hobbitserver.cfg
appended temps=ncv to test2rrd line in hobbitserver.cfg
restart hobbitd server

When I deployed the same extension code to a different server, the limitation of such an extension became painfully obvious: the graph definition failed when a different server reports more or less probes. Since hobbitgraph.cfg's graph section is keyed to the test name in hobbitgraph.cfg, I can't add custom graph section per server configuration, not to mention it is not scalable!. As for alerting, I tried setting up thresholds on the server. However, it didn't seem to generate any alert even if the threshold is obviously surpassed. wonder if the 'status blah GREEN' reported by the extension script kinda blocked such centralized checking.

After posting the above questions to Hobbit's mailing list and got no answers, I took upon myself to find the answers. Good thing that Hobbit Monitor is licensed using GPLv1. I found the answers in the source code, rrd/do_temperature.c. [[ All hail goes to Open Source & Henrik! ]]

the 'temperature' test is built-in. duh...
if 'temperature' is used as 'test name' on the client, nothing needs to be changed on the server end.
a lump-all status command will do. Format of the data portion is somewhat restrictive. Each probe needs to be in the format of '&green BMCambient 17 62'.
the test can be done locally on the client and the overall $status is reported by 'status server1.temperature $status'.

Wishes:

It'd be nice to have such data report format documented elsewhere other than the source code.
The current code (4.2RC1-20060712) didn't take input other than integer too well. It confused the hell out of it, in fact, to the extent it lump the whole long after $status as the probe name. Most of problems I experienced was actually the '.' in my data report confused Hobbit.
It'd be nice to be able to specify threshold on the server centrally.

Wednesday, December 13, 2006

How to coerce OMSA 5.1 to install & run properly on CentOS 4 (part II)

Not so fast! The same fix doesn't work on CentOS 4.1 running on a Dell PE6850. srvadmin-install.sh somehow determined it needed a newer version of OpenIPMI from its own package, regardless of whether the OS has latest OpenIPMI installed or not. Recall that the same hack worked w/o OpenIPMI installed at all on CentOS 4.4.

# Here is the excerpt from srvadmin-install.sh
Server Administrator requires a newer version of the OpenIPMI driver modules
for the running kernel than are currently installed on the system. The OpenIPMI
driver modules for the running kernel will be upgraded by installing an
OpenIPMI driver RPM.
warning: RPMS/supportRPMS/openipmi-33.13.RHEL4-1dkms.noarch.rpm: V3 DSA signature: NOKEY, key ID 23b66a9d
Preparing... ########################################### [100%]
1:openipmi ########################################### [100%]

Loading kernel module source and prebuilt module binaries (if any)
Installing prebuilt kernel module binaries (if any)

Module build for the currently running kernel was skipped since the
kernel source for this kernel does not seem to be installed.

Your DKMS tree now includes:
openipmi, 33.13.RHEL4, 2.6.9-5.EL, i686: built
openipmi, 33.13.RHEL4, 2.6.9-5.EL, x86_64: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELsmp, i686: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELsmp, x86_64: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELhugemem, i686: built

The OpenIPMI packages is from the CentOS 4.4 release DVD. So, it couldn't be newer !
# rpm -aq|grep -i OpenIPMI
OpenIPMI-libs-1.4.14-1.4E.13
OpenIPMI-1.4.14-1.4E.13

All our production servers don't have kernel source or development kits available. So, dkms approach isn't the one we'd take easily if we have other viable alternatives. So, I modified srvadmin-openipmi.sh to force it not to upgrade IPMI.
# diff srvadmin-openipmi.sh.orig srvadmin-openipmi.sh
1055c1055,1056
< loc_recommended_action="$?
> #LOC_RECOMMENDED_ACTION=$?
> LOC_RECOMMENDED_ACTION=4

This worked for the installation. However, upon start-up the services, the IPMI services failed to start. I ran into problem before with OMSA's own ipmi init script (dsm_sa_ipmi) deleting /dev/ipmi0 while udev never puts it back. So, as a simple hack, I just added a line to add the device at the beginning of the script.
test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0

After this, /etc/init.d/dsm_sa_ipmi would start ok, w/o OpenIPMI package installed. However, when it is called from srvadmin-services.sh (I suspect instsvcdrv is the actually caller). It failed each every time, causing omreport could't get any data. For now, I can manually start dsm_sa_ipmi and then start/stop srvadmin-services.sh just fine.

What makes worse (maybe even the cause of this IPMI start-up problem) is that the init sequence numbers assigned to all these init scripts are S50, instead of S00(mptctl), S96(instsvcdrv), and S97 for all the rest. I don't think this will work upon reboot or run level changes. I fixed it manually per chkconfig directives inside the scripts. However, I am curious why chkconfig got confused.

P.S. A more complete tweak would be at beginning of the function in srvadmin-openipmi.sh. The previous tweak on srvadmin-openipmi.sh works only when you have OpenIPMI installed.

# diff srvadmin-openipmi.sh.orig srvadmin-openipmi.sh
973a974,976
> LOC_RECOMMENDED_ACTION=4
> return ${LOC_RECOMMENDED_ACTION}

How to coerce OMSA 5.1 to install & run properly on CentOS 4

To monitor Dell PowerEdge servers running Linux, I downloaded OMSA (Open Manage Server Assistant) 5.1 package for Redhat Enterprise Linux 4 (RHEL4) on a bunch of Dell PowerEdge 6850 servers. The package was listed under 'system management' of the 'drivers & download' section. The actual file name is 'OM_5.1_ManNode_LIN_A00.tar.gz'.

On the PE6850 server, I unpacked the tarball, and attempted to install using the srvadmin-install.sh. It failed silently, with an error code of 2.
# supportscripts/srvadmin-install.sh
# echo $?
2
Examing srvadmin-install.sh closely, I realized that it didn't support CentOS 4. Instead, it supports & detects RHEL4 by its code node 'Nahant'. Instead of mucking the code to support CentOS, I decided to fake RHEL4 by modifying /etc/redhat-release.
# cat /etc/redhat-release
fake RHEL 4 Nahant
It worked.The installation went a-ok and all services started properly. 'omreport storage controllers' now reports various information on various storage controllers.

Thinking the faking was needed only by the installation scripts, I reverted /etc/redhat-release to its original version. To be sure, I restarted OMSA by ' srvadmin-services.sh stop|start'. Well, it didn't pan out that well.

# /usr/bin/srvadmin-services.sh start
Starting mptctl:
Waiting for mptctl driver registration to complete:
[ OK ] Starting Systems Management Device Drivers:
Starting dell_rbu: [ OK ]
Starting ipmi driver: [FAILED]
Starting Systems Management Device Drivers:
Starting dell_rbu: Already started [ OK ]
Starting ipmi driver: [FAILED]
Starting DSM SA Shared Services: [ OK ]

Starting DSM SA Connection Service: [ OK ]

Now it seemed the srvadmin-ipmi package's init script still relies on /etc/redhat-release to tell it which OS it is running on. The server doesn't have OpenIPMI installed. Without any IPMI driver loaded, no report can be run against storage or temps using OMSA.

To have it work for both world (OMSA and CentOS), I cooked up this /etc/redhat-release by appending the RHEL4 code name to the original /etc/redhat-release
# cat /etc/redhat-release.fake
CentOS release 4.4 (Final) Nahant
# /usr/bin/srvadmin-services.sh start
Starting mptctl:
Waiting for mptctl driver registration to complete: [ OK ] Starting Systems Management Device Drivers:
Starting dell_rbu: [ OK ]
Starting ipmi driver: [ OK ]
Starting Systems Management Data Engine:
Starting dsm_sa_datamgr32d: [ OK ]
Starting dsm_sa_eventmgr32d: [ OK ]
Starting DSM SA Shared Services: [ OK ]
Starting DSM SA Connection Service: [ OK ]

This cooked /etc/redhat-release had been verified to work well with 'yum -y update' and OMSA's 'srvadmin-services.sh start'

Wednesday, December 06, 2006

trick to enable non-bridged wireless access on a Netgear Wireless router (WGR614v6)

I bought a Netgear wireless router (WGR614v6) from CompUSA yesterday. It is going to a customer's site for WI-FI blackberries (BB7270). I need a wireless router that can simply

obtain its own TCP/IP via DHCP from a network port
self-contain its WLAN: NAT/SPI protect all wi-fi devices

I would be happier with a Netgear MR814v3, which we have in the lab and is known to work well with BB7270. All the bugs and glitches we ran into in the past year turned out to be on RIM's end, instead of the "poor" quality perceived by RIM's engineer of consumer-grade wireless access points.

Followed the auto-run wizard from the installation CD to the letter, I connected my own laptop to one of the LAN ports on the unit, with the unit's own Internet port plugged into the wall jacket in my office. Once I did a 'ipconfig /renew' on my laptop, the laptop got TCP/IP settings via DHCP as promised. It is a bit odd to see the DNS server is 192.168.1.1. This means the Netgear unit proxies DNS queries for devices utilizing its DHCP/NAT services. However, two things do not work right:

nslookup anything resolves to 192.168.1.1 ?
the LED on the front of the unit is not on for the WI-FI? 'Router status' says 'wireless :: off', while the checkbox to turn on 'wireless router radio' was shadowed out!

Searching on Netgear's site, I was relieved/disappointed to find the firmware was up to date. There's a link from WGR614's support page on how to enable the unit as a wireless access point. However, such enabling is to bridge the WLAN to the local LAN. I attempted that to satisfy my curiosity. It requires to use a regular LAN port as the autosense uplink instead of the separate Internet port on the unit. Once I did it, the laptop and a wifi blackberry were bridged to the real Ethernet LAN and got their TCP/IP settings from the real corporate DHCP server instead of from the unit's own DHCP server.

Lo & behold, everything went peachy all of a sudden, after this diversion. The checkbox to turn on wireless router radio option is now available, even if after I moved the cable from regular LAN port to Internet port on the unit. The wireless icon on the unit is blinking green. Power cycle the unit didn't eliminate such capability.

I am a bit at loss why wireless capability was disabled before and how it could be enabled by the steps I took. One thing I should have done at the beginning, is to reset it to factory default, just in case someone has messed it up somehow. The unit has all its seals and should be a new one. Anyhow, I am glad it worked out. However, I can't imagine any non-techie can get this far w/o picking up the phone or returning it to the retailer.

One thing to note is that WPA-PSK worked as advertised, both on the access point and on the BB7270. A lengthy 'mydoaminDS-24random' was used for SSID and the maximum allowed '63RANDOM' as the PSK (Pre-shared key). I couldn't imagine that I can type them in w/o a typo, so I emailed such lengthy secret to the email account currently assigned to my test BB7270. On the BB7270, I copy+paste these lengthy random bits from email to a new WLAN profile creation form under Options/WLAN. It worked great. With older version of BB7270 firmware, copy+paste WEP key fails each every time. I'd assume the bug was fixed by now (from v4.0.1.84 to v.4.0.1.104). On the BES (Blackberry Enterprise Solution) server, a site-specific "IT Policy" got created on the using these WLAN secrets worked great too. However, the SSID somehow took several kicks to get really sucked into the "IT policy". The BBM (Blackberry Manager) didn't complain when SSID didn't take, resulting that a BB7270 provisioned from scratch after nuking didn't carry any WLAN profile at all. Only upon reviewing the IT policy, I was shocked to find SSID field was blank while everything else under the WLAN policy section looked good.

miles wide miles deep