miles wide miles deep: system engineer

Showing posts with label system engineer. Show all posts

Friday, December 22, 2006

how to coerce OMSA 5.1 to install & run properly on CentOS 4 (part III)

Here is the summary of what I did for OMSA 5.1 on CentOS 4.1 ( the white-box twin of RHAS 4.1)

For installation, on a CentOS 4.1 without OpenIPMI or net-snmp, you just need to

append /etc/redhat-release with "Nahant", RHEL 4's code name.

1c1
< CentOS release 4.1 (Final) Nahant --- > CentOS release 4.1 (Final)
2. start installation by running ./linux/suppportscripts/srvadmin-install.sh
After installation, answer NO to start start all services. Instead, conduct the following steps first:
1. insert a line 'test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0' at the beginning of /etc/init.d/dsm_sa_ipmi. w/o it, /etc/init.d/dsm_sa_ipmi will fail miserablly. It seems that udev doesn't put /dev/ipmi0 back as expected.
22,24d21 < # my hack < test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0 < #
2. insert '/etc/init.d/dsm_sa_ipmi start' at the beginning of 'start' section inside /usr/bin/srvadmin- services.sh
w/o it, srvadmin-services.sh start will fail with IPMI drivers fail to load. However, if you start IPMI manually by '/etc/init.d/dsm_sa_ipmi start', you'd be just fine.
241,243d240 < # my hack::ipmi failed to start when called from later this script < /etc/init.d/dsm_sa_ipmi start <
3. now start it all with /usr/bin/srvadmin-services.sh start
4. verify it with ' omreport chassis temps' and 'omreport storage controller '. These two rely on different things to work. I'll elaborate on this later.

The rc scripts are placed in the S50 instead of their own start sequence number mandated in the scripts. I corrected them manually with my understanding of chkconfig directives in the scripts. However, I didn't get to change run level to test whether that would guarantee OMSA startup succesfully, since all my boxens are in production. Besides, Hobbit Monitor will let me know soon enough upon such failure.

It is such a great relief to have constant & comprehensive monitoring against critical systems. As a system engineer, you don't have to make all these mental or paper notes to check this or that. A decent NMS will do its job to nag you (whether you like or not) when necessary.

Wednesday, December 20, 2006

how to change IP address on a Linux server :: a project?

I was asked this question once, "how do you change IP address on a Linux server?" At its face value, it's a simple technical question on systems administration or network administration using a Linux server. So, my instant response is

edit /etc/sysconfig/ifcfg-eth0 (or eth?). or system-config-network TUI or GUI.
adjust routes and personal fw on the host as needed
ifconfig eth1 down && ifconfig eth1 up (or /etc/init.d/network restart)
verify: ifconfig eth1 (to see if the new IP has taken effect)
check & verify from remote host it is actually accessible (TCP/IP level, plus other services the server may provide)
update DNS if applicable

To dwell on the question a little more, and depends on the context of the conversation, the lines of questions, and who asks, I would expand a bit from the perspective of process management, configuration management, and knowledge management. In other words, I'd prefer to drive such a "simple" change as a small project. Thus, here comes the addition to my initial answer:

So, the actual change is simple. Depends on the function of the server and its inter-relationship with other systems, you probably need to manage such a change as a project. As they say, a little bit more of thinking and planning goes a long way.

gather requirement from functional groups and management
set a date to cut-over and obtain sign-off from stake-holders
plan for it (procedures/fall-back plan/check-list/notification/etc.)
communicate the planned change and its potential impact to stake-holders and end users
dry run if necessary
prior to cut-over: for days leading up to the cut-over, shorten TTL for A record for that IP, if it has DNS record and is accessed by DNS name instead of IP alone.
prior to cut-over: check to be sure ACL or conduits or fw rules get updated on router/firewalls and systems the server gain access to or from
during cut-over: keep team and stake-holders abreast of up-to-date status
post cut-over: verify by going through your check-list to determine success
post cut-over: enter change log for this event & notify users the cut-over is done
post cut-over: alias or real NIC to keep servicing the IP for limited period. (could also just alias the new IP onto the NIC serving the current IP, the cut-over becomes switching who's alias and who's real)
post cut-over: added DNAT or reverse proxy rules to catch traffic to the old IP and redirect/log/alert as necessary
post cut-over: document this in the knowledge-management system: a scrapbook, a WIKI, a technical writer.

Monday, December 18, 2006

Hobbit Monitor :: how to report multiple temperture probe results

For Dell PowerEdge servers, OMSA can report temperature for multiple probes, notably, "BMC Ambient", "BMC Planaar", "BMC Riser", and one for each physical processor. I initially wrote my own extension script named 'temps' , which reports all probes under one 'status server1.temps green' call. It works for a test server after I did the following on the Hobbit server:

defined a RRD graph section in hobbitgraph.cfg for all probes available on that server
NCV_temps='*:GAUGE" in hobbitserver.cfg
appended temps=ncv to test2rrd line in hobbitserver.cfg
restart hobbitd server

When I deployed the same extension code to a different server, the limitation of such an extension became painfully obvious: the graph definition failed when a different server reports more or less probes. Since hobbitgraph.cfg's graph section is keyed to the test name in hobbitgraph.cfg, I can't add custom graph section per server configuration, not to mention it is not scalable!. As for alerting, I tried setting up thresholds on the server. However, it didn't seem to generate any alert even if the threshold is obviously surpassed. wonder if the 'status blah GREEN' reported by the extension script kinda blocked such centralized checking.

After posting the above questions to Hobbit's mailing list and got no answers, I took upon myself to find the answers. Good thing that Hobbit Monitor is licensed using GPLv1. I found the answers in the source code, rrd/do_temperature.c. [[ All hail goes to Open Source & Henrik! ]]

the 'temperature' test is built-in. duh...
if 'temperature' is used as 'test name' on the client, nothing needs to be changed on the server end.
a lump-all status command will do. Format of the data portion is somewhat restrictive. Each probe needs to be in the format of '&green BMCambient 17 62'.
the test can be done locally on the client and the overall $status is reported by 'status server1.temperature $status'.

Wishes:

It'd be nice to have such data report format documented elsewhere other than the source code.
The current code (4.2RC1-20060712) didn't take input other than integer too well. It confused the hell out of it, in fact, to the extent it lump the whole long after $status as the probe name. Most of problems I experienced was actually the '.' in my data report confused Hobbit.
It'd be nice to be able to specify threshold on the server centrally.

Wednesday, December 13, 2006

How to coerce OMSA 5.1 to install & run properly on CentOS 4 (part II)

Not so fast! The same fix doesn't work on CentOS 4.1 running on a Dell PE6850. srvadmin-install.sh somehow determined it needed a newer version of OpenIPMI from its own package, regardless of whether the OS has latest OpenIPMI installed or not. Recall that the same hack worked w/o OpenIPMI installed at all on CentOS 4.4.

# Here is the excerpt from srvadmin-install.sh
Server Administrator requires a newer version of the OpenIPMI driver modules
for the running kernel than are currently installed on the system. The OpenIPMI
driver modules for the running kernel will be upgraded by installing an
OpenIPMI driver RPM.
warning: RPMS/supportRPMS/openipmi-33.13.RHEL4-1dkms.noarch.rpm: V3 DSA signature: NOKEY, key ID 23b66a9d
Preparing... ########################################### [100%]
1:openipmi ########################################### [100%]

Loading kernel module source and prebuilt module binaries (if any)
Installing prebuilt kernel module binaries (if any)

Module build for the currently running kernel was skipped since the
kernel source for this kernel does not seem to be installed.

Your DKMS tree now includes:
openipmi, 33.13.RHEL4, 2.6.9-5.EL, i686: built
openipmi, 33.13.RHEL4, 2.6.9-5.EL, x86_64: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELsmp, i686: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELsmp, x86_64: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELhugemem, i686: built

The OpenIPMI packages is from the CentOS 4.4 release DVD. So, it couldn't be newer !
# rpm -aq|grep -i OpenIPMI
OpenIPMI-libs-1.4.14-1.4E.13
OpenIPMI-1.4.14-1.4E.13

All our production servers don't have kernel source or development kits available. So, dkms approach isn't the one we'd take easily if we have other viable alternatives. So, I modified srvadmin-openipmi.sh to force it not to upgrade IPMI.
# diff srvadmin-openipmi.sh.orig srvadmin-openipmi.sh
1055c1055,1056
< loc_recommended_action="$?
> #LOC_RECOMMENDED_ACTION=$?
> LOC_RECOMMENDED_ACTION=4

This worked for the installation. However, upon start-up the services, the IPMI services failed to start. I ran into problem before with OMSA's own ipmi init script (dsm_sa_ipmi) deleting /dev/ipmi0 while udev never puts it back. So, as a simple hack, I just added a line to add the device at the beginning of the script.
test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0

After this, /etc/init.d/dsm_sa_ipmi would start ok, w/o OpenIPMI package installed. However, when it is called from srvadmin-services.sh (I suspect instsvcdrv is the actually caller). It failed each every time, causing omreport could't get any data. For now, I can manually start dsm_sa_ipmi and then start/stop srvadmin-services.sh just fine.

What makes worse (maybe even the cause of this IPMI start-up problem) is that the init sequence numbers assigned to all these init scripts are S50, instead of S00(mptctl), S96(instsvcdrv), and S97 for all the rest. I don't think this will work upon reboot or run level changes. I fixed it manually per chkconfig directives inside the scripts. However, I am curious why chkconfig got confused.

P.S. A more complete tweak would be at beginning of the function in srvadmin-openipmi.sh. The previous tweak on srvadmin-openipmi.sh works only when you have OpenIPMI installed.

# diff srvadmin-openipmi.sh.orig srvadmin-openipmi.sh
973a974,976
> LOC_RECOMMENDED_ACTION=4
> return ${LOC_RECOMMENDED_ACTION}

How to coerce OMSA 5.1 to install & run properly on CentOS 4

To monitor Dell PowerEdge servers running Linux, I downloaded OMSA (Open Manage Server Assistant) 5.1 package for Redhat Enterprise Linux 4 (RHEL4) on a bunch of Dell PowerEdge 6850 servers. The package was listed under 'system management' of the 'drivers & download' section. The actual file name is 'OM_5.1_ManNode_LIN_A00.tar.gz'.

On the PE6850 server, I unpacked the tarball, and attempted to install using the srvadmin-install.sh. It failed silently, with an error code of 2.
# supportscripts/srvadmin-install.sh
# echo $?
2
Examing srvadmin-install.sh closely, I realized that it didn't support CentOS 4. Instead, it supports & detects RHEL4 by its code node 'Nahant'. Instead of mucking the code to support CentOS, I decided to fake RHEL4 by modifying /etc/redhat-release.
# cat /etc/redhat-release
fake RHEL 4 Nahant
It worked.The installation went a-ok and all services started properly. 'omreport storage controllers' now reports various information on various storage controllers.

Thinking the faking was needed only by the installation scripts, I reverted /etc/redhat-release to its original version. To be sure, I restarted OMSA by ' srvadmin-services.sh stop|start'. Well, it didn't pan out that well.

# /usr/bin/srvadmin-services.sh start
Starting mptctl:
Waiting for mptctl driver registration to complete:
[ OK ] Starting Systems Management Device Drivers:
Starting dell_rbu: [ OK ]
Starting ipmi driver: [FAILED]
Starting Systems Management Device Drivers:
Starting dell_rbu: Already started [ OK ]
Starting ipmi driver: [FAILED]
Starting DSM SA Shared Services: [ OK ]

Starting DSM SA Connection Service: [ OK ]

Now it seemed the srvadmin-ipmi package's init script still relies on /etc/redhat-release to tell it which OS it is running on. The server doesn't have OpenIPMI installed. Without any IPMI driver loaded, no report can be run against storage or temps using OMSA.

To have it work for both world (OMSA and CentOS), I cooked up this /etc/redhat-release by appending the RHEL4 code name to the original /etc/redhat-release
# cat /etc/redhat-release.fake
CentOS release 4.4 (Final) Nahant
# /usr/bin/srvadmin-services.sh start
Starting mptctl:
Waiting for mptctl driver registration to complete: [ OK ] Starting Systems Management Device Drivers:
Starting dell_rbu: [ OK ]
Starting ipmi driver: [ OK ]
Starting Systems Management Data Engine:
Starting dsm_sa_datamgr32d: [ OK ]
Starting dsm_sa_eventmgr32d: [ OK ]
Starting DSM SA Shared Services: [ OK ]
Starting DSM SA Connection Service: [ OK ]

This cooked /etc/redhat-release had been verified to work well with 'yum -y update' and OMSA's 'srvadmin-services.sh start'

Wednesday, December 06, 2006

trick to enable non-bridged wireless access on a Netgear Wireless router (WGR614v6)

I bought a Netgear wireless router (WGR614v6) from CompUSA yesterday. It is going to a customer's site for WI-FI blackberries (BB7270). I need a wireless router that can simply

obtain its own TCP/IP via DHCP from a network port
self-contain its WLAN: NAT/SPI protect all wi-fi devices

I would be happier with a Netgear MR814v3, which we have in the lab and is known to work well with BB7270. All the bugs and glitches we ran into in the past year turned out to be on RIM's end, instead of the "poor" quality perceived by RIM's engineer of consumer-grade wireless access points.

Followed the auto-run wizard from the installation CD to the letter, I connected my own laptop to one of the LAN ports on the unit, with the unit's own Internet port plugged into the wall jacket in my office. Once I did a 'ipconfig /renew' on my laptop, the laptop got TCP/IP settings via DHCP as promised. It is a bit odd to see the DNS server is 192.168.1.1. This means the Netgear unit proxies DNS queries for devices utilizing its DHCP/NAT services. However, two things do not work right:

nslookup anything resolves to 192.168.1.1 ?
the LED on the front of the unit is not on for the WI-FI? 'Router status' says 'wireless :: off', while the checkbox to turn on 'wireless router radio' was shadowed out!

Searching on Netgear's site, I was relieved/disappointed to find the firmware was up to date. There's a link from WGR614's support page on how to enable the unit as a wireless access point. However, such enabling is to bridge the WLAN to the local LAN. I attempted that to satisfy my curiosity. It requires to use a regular LAN port as the autosense uplink instead of the separate Internet port on the unit. Once I did it, the laptop and a wifi blackberry were bridged to the real Ethernet LAN and got their TCP/IP settings from the real corporate DHCP server instead of from the unit's own DHCP server.

Lo & behold, everything went peachy all of a sudden, after this diversion. The checkbox to turn on wireless router radio option is now available, even if after I moved the cable from regular LAN port to Internet port on the unit. The wireless icon on the unit is blinking green. Power cycle the unit didn't eliminate such capability.

I am a bit at loss why wireless capability was disabled before and how it could be enabled by the steps I took. One thing I should have done at the beginning, is to reset it to factory default, just in case someone has messed it up somehow. The unit has all its seals and should be a new one. Anyhow, I am glad it worked out. However, I can't imagine any non-techie can get this far w/o picking up the phone or returning it to the retailer.

One thing to note is that WPA-PSK worked as advertised, both on the access point and on the BB7270. A lengthy 'mydoaminDS-24random' was used for SSID and the maximum allowed '63RANDOM' as the PSK (Pre-shared key). I couldn't imagine that I can type them in w/o a typo, so I emailed such lengthy secret to the email account currently assigned to my test BB7270. On the BB7270, I copy+paste these lengthy random bits from email to a new WLAN profile creation form under Options/WLAN. It worked great. With older version of BB7270 firmware, copy+paste WEP key fails each every time. I'd assume the bug was fixed by now (from v4.0.1.84 to v.4.0.1.104). On the BES (Blackberry Enterprise Solution) server, a site-specific "IT Policy" got created on the using these WLAN secrets worked great too. However, the SSID somehow took several kicks to get really sucked into the "IT policy". The BBM (Blackberry Manager) didn't complain when SSID didn't take, resulting that a BB7270 provisioned from scratch after nuking didn't carry any WLAN profile at all. Only upon reviewing the IT policy, I was shocked to find SSID field was blank while everything else under the WLAN policy section looked good.

Sunday, October 22, 2006

which is eth0 on a PowerEdge 6850 running CentOS 4.4/i386

When setting up the refurbished Dell PowerEdge 6850 I purchased recently, I noticed it is newer and has a few oddities

no video connectors at front panel nor the back. SVGA video only on the DRAC 5 card.
a dual-port Intel Pro copper Ethernet card
PERC adapter is of 4e/DC instead of the 4e/Di with the other two PE6850 bought new from Dell last summer.

A minimal installation of Linux is done smoothly using a CentOS 4.4/i386 server-cd. Well, I did run into one hitch that caused problems getting the network connectivity going. Initially, I didn't have confidence in ports on both ends of the cat5e cable:

The "new" port is in an aged Cisco Catalyst 2900XL switch bought off eBay in 2000. My boss told me several times one port is known to be bad on this switch, since I joined the company a year ago. I used twelve free ports so far without problem. This port is one of the last two "free" ports on this 24-port switch.
The port was in VLAN #1 to mirror two neighboring ports and had not been in active use. I programmed the switch such that the port now participates VLAN#3 instead of VLAN#1, and no longer mirrors any port.
I was hoping to see a steady green LED and a blinking yellow LED on the NIC, by switching the cable from port to port on the server. The results were not telling. The first embedded Broadcom port had that, but no connectivity. The two Intel-pro ports had only green LED, no blinking yellow LED nor connectivity. I later came to the realization that the fast switching from port to port won't work well, probably due to ARP cache and such on the switch.

After a bit of frustration, I decided to leave the cable in the first embedded port and to do some serious work on the server side to make it work. Need a quieter place to think and type, I borrowed a serial cable to access its serial console. All the boxen I set up have serial console enabled in /etc/inittab and GRUB or LILO. For Dell PowerEdge servers, with BIOS-redirect-to-serial enabled too. Too busy to bargain hunt a serial condenser. I used Digi board before and it is nice. However, Digi board is too expensive a proposition for the current employer who's satisfied with the capability to call the ISP technician to hook up monitor and keyboard to their servers.

Once sitting down in the much quieter guest working area, I immediately put the pieces together and realized that the embedded Broadcom Ethernet ports were enumerated as eth2 and eth3, while the two ports on the Intel-Pro PCI-e card were as eth0 and eth1. Here is my "detective work":

grep eth /etc/modprobe .conf
alias eth0 e1000 alias eth1 e1000 alias eth2 tg3 alias eth3 tg3
dmesg
e1000: 0000:17:06.0: e1000_probe: (PCI:66MHz:64-bit) 00:04:23:cd:6e:b8 divert: allocating divert_blk for eth0 e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection ACPI: PCI interrupt 0000:17:06.1[B] -> GSI 133 (level, low) -> IRQ 50 e1000: 0000:17:06.1: e1000_probe: (PCI:66MHz:64-bit) 00:04:23:cd:6e:b9 divert: allocating divert_blk for eth1 e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection tg3.c:v3.52-rh (Mar 06, 2006) ACPI: PCI interrupt 0000:14:02.0[A] -> GSI 64 (level, low) -> IRQ 201 divert: allocating divert_blk for eth2 eth2: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:13:72:40:e1:bd eth2: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[1] Split[0] WireSpeed[1] TSOcap[0] eth2: dma_rwctrl[769f4000] ACPI: PCI interrupt 0000:14: 02.1[B] -> GSI 65 (level, low) -> IRQ 209 divert: allocating divert_blk for eth3 eth3: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:13:72:40:e1:be eth3: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[0] Split[0] WireSpeed[1] TSOcap[1] eth3: dma_rwctrl[769f4000]
grep HWA /etc/sysconfig/network-scripts/ifconfig-eth?
ifcfg-eth1:HWADDR=00:04:23:CD:6E:B9 ifcfg-eth2:HWADDR=00:13:72:40:E1:BD ifcfg-eth3:HWADDR=00:13:72:40:E1:BE ifcfg-eth0:HWADDR=00:04:23:CD:6E:B8

With the card-to-driver mapping in dmesg and in the modprobe.conf, it is easy to see the embedded Broadcom ethernet ports are mapped to eth2 and eth3, while the add-on Intel-pro ports are mapped to eth0 and eth1. The mac addresses in the 'dmesg' and the HWADDR recorded in ifcfg-eth1 through ifcfg-eth3 files hammered down the last nail. Without the last nail, it'd rather hard to tell if all ports is of the same make and model.

mii-tool sorta told the tale as well. lspci can help too.
eth0: no link eth1: no link eth2: 10baseTx, HD, link ok eth3: no link
Once I duplicated the IP/netmask on eth0 to eth2, then ifdown eth0 and ifup eth2, all is well now per ' mii-tool'
eth2: negotiated 100baseTx-FD, link ok

It'd be an interesting exercises to force eth0 to be the first embedded Broadcom port. It makes more sense for Linux to enumerate the embeded ports first, ethernet or not. What do you think?