miles wide miles deep: centos

Showing posts with label centos. Show all posts

Friday, December 22, 2006

how to coerce OMSA 5.1 to install & run properly on CentOS 4 (part III)

Here is the summary of what I did for OMSA 5.1 on CentOS 4.1 ( the white-box twin of RHAS 4.1)

For installation, on a CentOS 4.1 without OpenIPMI or net-snmp, you just need to

append /etc/redhat-release with "Nahant", RHEL 4's code name.

1c1
< CentOS release 4.1 (Final) Nahant --- > CentOS release 4.1 (Final)
2. start installation by running ./linux/suppportscripts/srvadmin-install.sh
After installation, answer NO to start start all services. Instead, conduct the following steps first:
1. insert a line 'test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0' at the beginning of /etc/init.d/dsm_sa_ipmi. w/o it, /etc/init.d/dsm_sa_ipmi will fail miserablly. It seems that udev doesn't put /dev/ipmi0 back as expected.
22,24d21 < # my hack < test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0 < #
2. insert '/etc/init.d/dsm_sa_ipmi start' at the beginning of 'start' section inside /usr/bin/srvadmin- services.sh
w/o it, srvadmin-services.sh start will fail with IPMI drivers fail to load. However, if you start IPMI manually by '/etc/init.d/dsm_sa_ipmi start', you'd be just fine.
241,243d240 < # my hack::ipmi failed to start when called from later this script < /etc/init.d/dsm_sa_ipmi start <
3. now start it all with /usr/bin/srvadmin-services.sh start
4. verify it with ' omreport chassis temps' and 'omreport storage controller '. These two rely on different things to work. I'll elaborate on this later.

The rc scripts are placed in the S50 instead of their own start sequence number mandated in the scripts. I corrected them manually with my understanding of chkconfig directives in the scripts. However, I didn't get to change run level to test whether that would guarantee OMSA startup succesfully, since all my boxens are in production. Besides, Hobbit Monitor will let me know soon enough upon such failure.

It is such a great relief to have constant & comprehensive monitoring against critical systems. As a system engineer, you don't have to make all these mental or paper notes to check this or that. A decent NMS will do its job to nag you (whether you like or not) when necessary.

Wednesday, December 13, 2006

How to coerce OMSA 5.1 to install & run properly on CentOS 4 (part II)

Not so fast! The same fix doesn't work on CentOS 4.1 running on a Dell PE6850. srvadmin-install.sh somehow determined it needed a newer version of OpenIPMI from its own package, regardless of whether the OS has latest OpenIPMI installed or not. Recall that the same hack worked w/o OpenIPMI installed at all on CentOS 4.4.

# Here is the excerpt from srvadmin-install.sh
Server Administrator requires a newer version of the OpenIPMI driver modules
for the running kernel than are currently installed on the system. The OpenIPMI
driver modules for the running kernel will be upgraded by installing an
OpenIPMI driver RPM.
warning: RPMS/supportRPMS/openipmi-33.13.RHEL4-1dkms.noarch.rpm: V3 DSA signature: NOKEY, key ID 23b66a9d
Preparing... ########################################### [100%]
1:openipmi ########################################### [100%]

Loading kernel module source and prebuilt module binaries (if any)
Installing prebuilt kernel module binaries (if any)

Module build for the currently running kernel was skipped since the
kernel source for this kernel does not seem to be installed.

Your DKMS tree now includes:
openipmi, 33.13.RHEL4, 2.6.9-5.EL, i686: built
openipmi, 33.13.RHEL4, 2.6.9-5.EL, x86_64: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELsmp, i686: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELsmp, x86_64: built
openipmi, 33.13.RHEL4, 2.6.9-5.ELhugemem, i686: built

The OpenIPMI packages is from the CentOS 4.4 release DVD. So, it couldn't be newer !
# rpm -aq|grep -i OpenIPMI
OpenIPMI-libs-1.4.14-1.4E.13
OpenIPMI-1.4.14-1.4E.13

All our production servers don't have kernel source or development kits available. So, dkms approach isn't the one we'd take easily if we have other viable alternatives. So, I modified srvadmin-openipmi.sh to force it not to upgrade IPMI.
# diff srvadmin-openipmi.sh.orig srvadmin-openipmi.sh
1055c1055,1056
< loc_recommended_action="$?
> #LOC_RECOMMENDED_ACTION=$?
> LOC_RECOMMENDED_ACTION=4

This worked for the installation. However, upon start-up the services, the IPMI services failed to start. I ran into problem before with OMSA's own ipmi init script (dsm_sa_ipmi) deleting /dev/ipmi0 while udev never puts it back. So, as a simple hack, I just added a line to add the device at the beginning of the script.
test -e /dev/ipmi0 || mknod -m 0600 /dev/ipmi0 c 253 0

After this, /etc/init.d/dsm_sa_ipmi would start ok, w/o OpenIPMI package installed. However, when it is called from srvadmin-services.sh (I suspect instsvcdrv is the actually caller). It failed each every time, causing omreport could't get any data. For now, I can manually start dsm_sa_ipmi and then start/stop srvadmin-services.sh just fine.

What makes worse (maybe even the cause of this IPMI start-up problem) is that the init sequence numbers assigned to all these init scripts are S50, instead of S00(mptctl), S96(instsvcdrv), and S97 for all the rest. I don't think this will work upon reboot or run level changes. I fixed it manually per chkconfig directives inside the scripts. However, I am curious why chkconfig got confused.

P.S. A more complete tweak would be at beginning of the function in srvadmin-openipmi.sh. The previous tweak on srvadmin-openipmi.sh works only when you have OpenIPMI installed.

# diff srvadmin-openipmi.sh.orig srvadmin-openipmi.sh
973a974,976
> LOC_RECOMMENDED_ACTION=4
> return ${LOC_RECOMMENDED_ACTION}

How to coerce OMSA 5.1 to install & run properly on CentOS 4

To monitor Dell PowerEdge servers running Linux, I downloaded OMSA (Open Manage Server Assistant) 5.1 package for Redhat Enterprise Linux 4 (RHEL4) on a bunch of Dell PowerEdge 6850 servers. The package was listed under 'system management' of the 'drivers & download' section. The actual file name is 'OM_5.1_ManNode_LIN_A00.tar.gz'.

On the PE6850 server, I unpacked the tarball, and attempted to install using the srvadmin-install.sh. It failed silently, with an error code of 2.
# supportscripts/srvadmin-install.sh
# echo $?
2
Examing srvadmin-install.sh closely, I realized that it didn't support CentOS 4. Instead, it supports & detects RHEL4 by its code node 'Nahant'. Instead of mucking the code to support CentOS, I decided to fake RHEL4 by modifying /etc/redhat-release.
# cat /etc/redhat-release
fake RHEL 4 Nahant
It worked.The installation went a-ok and all services started properly. 'omreport storage controllers' now reports various information on various storage controllers.

Thinking the faking was needed only by the installation scripts, I reverted /etc/redhat-release to its original version. To be sure, I restarted OMSA by ' srvadmin-services.sh stop|start'. Well, it didn't pan out that well.

# /usr/bin/srvadmin-services.sh start
Starting mptctl:
Waiting for mptctl driver registration to complete:
[ OK ] Starting Systems Management Device Drivers:
Starting dell_rbu: [ OK ]
Starting ipmi driver: [FAILED]
Starting Systems Management Device Drivers:
Starting dell_rbu: Already started [ OK ]
Starting ipmi driver: [FAILED]
Starting DSM SA Shared Services: [ OK ]

Starting DSM SA Connection Service: [ OK ]

Now it seemed the srvadmin-ipmi package's init script still relies on /etc/redhat-release to tell it which OS it is running on. The server doesn't have OpenIPMI installed. Without any IPMI driver loaded, no report can be run against storage or temps using OMSA.

To have it work for both world (OMSA and CentOS), I cooked up this /etc/redhat-release by appending the RHEL4 code name to the original /etc/redhat-release
# cat /etc/redhat-release.fake
CentOS release 4.4 (Final) Nahant
# /usr/bin/srvadmin-services.sh start
Starting mptctl:
Waiting for mptctl driver registration to complete: [ OK ] Starting Systems Management Device Drivers:
Starting dell_rbu: [ OK ]
Starting ipmi driver: [ OK ]
Starting Systems Management Data Engine:
Starting dsm_sa_datamgr32d: [ OK ]
Starting dsm_sa_eventmgr32d: [ OK ]
Starting DSM SA Shared Services: [ OK ]
Starting DSM SA Connection Service: [ OK ]

This cooked /etc/redhat-release had been verified to work well with 'yum -y update' and OMSA's 'srvadmin-services.sh start'

Friday, November 10, 2006

perc 4e/Di on Dell PE6850 saga continues...part C

With the load from the full database dump plus the application load burst we set up last Friday , the problematic server 'syb04' generated a few alerts over the weekend. The alerts complained the stamp didn't show up right after the 'logger' call. We were very excited, thinking we were able to reproduce the problem this quick. The next thing would be just to pick out what to upgrade from a decent list of potential upgrades.

Examining closely the local log as well as the log on the remote syslogd server, however, showed that such 'missing' stamps appeared up right after the complaints of their absence. Most were within the same second, and only one was one second late. Therefore such alerts were identified as false positives.

Needless to say, we were very disappointed. Even worse, this remained the case for the full week. We started to toss around the idea maybe the hiccup was merely a delay and we overacted a little by flipping the switch.

To "add insult to the injury", our constant attention was demanded by a lot of database problems related to application peak load which was coerced to repeat. The problems were:

Sybase ASE log device filled up, causing the application peak load come to a sudden halt, until Sybase is restarted with log cleared.
Hourly transaction has grown from 20M each to over 1G each. It seemed like some transaction failed to be committed.
In turn the transaction dumps filled up the disk.

So far, the server endured seven 24-hour days of load run, which totals 24x7x3= $504 load peaks. A regular day have only 2 peaks, therefore, this equals to 250 days worth of load.

I am betting a small sum of money on the PR (patrol read), whose background scheduling may be surprised by the sudden spike in disk IO caused by the nightly full database backup as well as the daily application peak. To force PR to collide with the load, I wrote a script to check PR status and start one if none is 'In Progress' already, as reported by 'megapr -dispPR -a0'.

BTW, the 'megapr -dispPR -a0' command alone causes the following errors in PERC controller's exportlog. My inquiry on this error got no response from Dell's linux-PowerEdge forum, which is monitored by a few Dell engineers.
11/07 10:25:51: MPT_Rec: INQ Error - Negotiating LD[6] pRfm a07517c0
11/07 10:25:51: MPT_Rec: INQ Error - Negotiating LD[16] pRfm a0743360
11/07 10:25:51: GET: SCSI_chn=ff, rtn status=0

Tuesday, November 07, 2006

debugfs :: handy utility to debug an ext2 or ext3 file system

Today I am learning a useful utility program named 'debugfs'. It is part of e2fsprogs package, an essential package containing axillary programs for ext2 and ext3 file system under Linux.

For a regular file, 'debugfs' can help you find an inode by any data block the file or dir entry is using. Then you can turn around and ask for the name of the inode. This could be handy when some mysterious files causing df and du to disagree whether the filie system is full, or the file system is corrupted or can't mounted to be accessed as usual. More advanced file system features are available too.

# to find what inode is claiming a given data block
# debugfs -R "icheck 12345" /dev/hda1
debugfs 1.35 (28-Feb-2004) Block Inode number 12345 340

# to find the file name given the inode number
# debugfs -R "ncheck 49153" /dev/hda1 debugfs 1.35 (28-Feb-2004) Inode Pathname
49153 /usr/share/locale/ar/LC_MESSAGES/libbonobo-2.0.mo

# Print the location of the inode data structure
# debugfs -R "imap /boot/vmlinuz-2.6.9-42.0.2.EL" /dev/hda1 debugfs 1.35 (28-Feb-2004)
Inode 557516 is part of block group 34
located at block 1114128, offset 0x0580

# to dump the direntry (filespec, per man page)
debugfs -R "dump -p /boot/vmlinuz- 2.6.9-42.0.2.EL /tmp/vmlinuz_dumped" /dev/hda1
# md5sum /boot/vmlinuz-2.6.9-42.0.2.EL /tmp/vmlinuz_dumped e5c536b539b5ffcaa03b22bd7fcc164a /boot/vmlinuz-2.6.9-42.0.2.EL e5c536b539b5ffcaa03b22bd7fcc164a /tmp/vmlinuz_dumped

# to get the contents of a file, assume the fs can't be mounted and accessed the usually way.
# debugfs -R "cat /etc/redhat-release" /dev/hda1
debugfs 1.35 (28-Feb-2004) CentOS release 4.4 (Final)

Noteworthy is, for files under /selinux ( a pseudo fs), it can find inode number associated with a data block. However, it couldn't find the file name for the very inode number.
# debugfs -R "ncheck 8" /dev/hda1 debugfs 1.35 (28-Feb-2004)
Inode Pathname 8
# find / -inum 8
/selinux/relabel
# ls -id /selinux/relabel
8 /selinux/relabel
# debugfs -R "icheck 4567" /dev/hda1 debugfs 1.35 (28-Feb-2004) Block Inode number
4567 8
# / is on /dev/hda1
/dev/hda1 8127400 6738524 1306308 84% /

There are a lot of powerful (and dangerous) features such as

feature you can set or clear various file system features in the superblock
freeb to mark data blocks as unallocated vs. setb
freei to free the inode specified
clri to clear the contents of the inode
chroot to chroot to the directory
find_free_block
find_free_inode
init_filesys to create an ext2 file system
kill_file deallocate the file and its blocks. It doesn't remove any direntry to this inode. not ' rm' or 'unlink'.
logdump to dump the ext3 journal
modify_inode modify the contents of the inode structure
ls/mkdir/mknod/rm/rmdir

'debugfs' starts interactively by default, unless you have '-R' to request one-time use only. A session would be like below:
# debugfs
debugfs 1.35 (28-Feb-2004)
debugfs: open /dev/hda1
debugfs: icheck 12345
Block Inode number
12345 340
debugfs: ncheck 340
Inode Pathname
340 /usr/X11R6/lib/xscreensaver/mountain
debugfs: close
debugfs: quit

Saturday, November 04, 2006

perc 4e/Di on Dell PE6850 saga continues...part B

After searching up & down, I compiled a decent list of potential upgrades and toggles to try out on syb04. None of them is apparently pertinent enough to have you say 'ahh-ha'. I purchased and put into production a new server named syb06, the one killed by oom-killer and cured by kernel-hugemem. With the improved production configuration mix, time is more affordable than last time when syb04 locked up. So, our team of 'experts' decided to reproduce the recent hiccup or the older lockup problem reliably before we attempt a fix this time around.

Hobbit Monitor is running all the servers, so it is rather easy to catch the old lockup problem wherein all checks went to 'purple', as in 'stale', or 'no report received' status. It is a bit tricky to detect when a hiccup happens. If it happens squarely inside the 5-minute interval Hobbit Monitor uses, we'd miss the signal! It seems it is not all that easy to change monitor frequency down to 1 minute for one single client, as nobody has answered my question on the Hobbit mailing list for three days now. After much discussion of alternatives, I come up with a way and verified it works.

With the monitor fine-tuned and focused on syb04, load is added to it first. Count full nightly database backup and daily peak as two load situations, we need have at least 28 peaks to equate to the 14 days leading up to the lockup. The nightly database backup takes only 25 minutes, and is very easy to run it continuously by simply changing cron schedule to every 30 minutes instead of every day. So, we did that. After 20 hours (~= 40 load peaks), nothing happened. Since we don't plan to work over the weekend, it is decided to simulate the daily load peak and let it run continuously. It took some Java code change and it is done. So, we'd have both the application load and the backup load against the server over the weekend.

* fingers-crossed *

Maxtor Personal Storage 3200 320 GB external USB 2.0 drive to stage backups under Linux

Today I bought a Maxtor Personal Storage 3200 320 GB (Model U01H320) from Circuit City. 320G with 8M cache for merely $159. They matched their online price by taking $20 off the $179 in-store price. Amazon.com is selling the same item on behalf of CompUSA for $169.
The drive will be used to stage backups at our co-lo. At work, we've waited too long for a budget to come through for a real NAS or a tape library/magzine/autoloader. It is one of those day-to-day challenges of small businesses which face the same challenges with much more limited resources. The daily full backup set now amounts to 12G, all compressed by gzip, my favorite tool.

From a RHEL 4.4 AS (CentOS 4.4, to be more exact) guest inside VMWare, the drive was detected as USB 1.1 (full speed) instead of USB 2.0. I assume it is a limitation of the VMWare emulation.
usb 1-1: new full speed USB device using address 2
Initializing USB Mass Storage driver...
scsi1 : SCSI emulation for USB Mass Storage devices
Vendor: Maxtor Model: 3200 Rev: 0341
Type: Direct-Access ANSI SCSI revision: 02 USB Mass Storage device found at 2
usbcore: registered new driver usb-storage
USB Mass Storage support registered.
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: assuming drive cache: write through
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: assuming drive cache: write through
sda: sda1 Attached scsi disk sda at scsi1, channel 0, id 0, lun 0

Pretty interesting to see it actually has NTFS as partition type. wonder how it would fly with MacOS out of the box?
Disk /dev/sda: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 38913 312568641 7 HPFS/NTFS
Since I am gonna use it under RHEL 4, I go ahead relabel the partition as 'Linux' and formatted it as EXT3.
/root# mke2fs -L E320G -O sparse_super,dir_index,filetype -T largefile4 -j /dev/sda1
mke2fs 1.35 (28-Feb-2004)
Filesystem label=E320G
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
76320 inodes, 78142160 blocks
3907108 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=79691776 2385 block groups
32768 blocks per group, 32768 fragments per group
32 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616 Writing inode tables: done
Creating journal (8192 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
Ahh, I always forgot to reduce the percentage reserved for super user from the default 5% to 1%.
/dev/sda1 312538740 98368 296811940 1% /mnt
/root# tune2fs -m 1 /dev/sda1
tune2fs 1.35 (28-Feb-2004)
Setting reserved blocks percentage to 1 (781421 blocks)
/root# df -kv /mnt
Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 312538740 98368 309314688 1% /mnt

Wednesday, November 01, 2006

perc 4e/Di on Dell PE6850 saga continues...part A

We ended up applying BIOS upgrade (A00->A01) and PERC 4e/Di firmware upgrade (521A to 522A A13) for the system lockup problems we had on the production database server running on a Dell PE6850. Home-made load tests didn't cause panic for 18 hours. The server was then rushed back into production since the fail-over spare server couldn't stand the load.

The server (the Sybase database engines) has been up for 14 days today. At 09:50am, just when the server started to ramp up to its daily load peak (CPU load ~=4) , some processes failed to write to the disk and 'date > junk' from cmdline just hang there. I canceled that 'date>junk'. All is good after less than 4 minutes. Nothing interesting (warn/error/abort) in the system log, exportlog from PERC controller, or database log. PR was running at the time.

The symptoms definitely differ, so the BIOS and firmware upgrade did make some difference towards the better. For the previous two lockups and the only two for 15 months, we lost access to the disks totally, getting "reject i/o to offlined disk" without kernel panic or corruption. This time, this is merely a hiccup or pause or suspension of sorts.

Older postings on similar topic on dell-linux-poweredge forum suggested PR could be the culprit if BIOS/firmware is up-to-date. On the system, I get the following output from '"megapr -dispPR -a0" today. Is #Iterations current count of the total PR has run or a threshold or some sort? If the former, how to clear it? If the latter, how to increase? Basically I am looking into why it locked up exactly 30 days (could be coincidence too. and we are now using newer BIOS and firmware). Dell diag from OMSA 4.4 on 10/17/2006 suggests nothing wrong the controller, memory, or underlying disks. (omreport on the controller is appended below too).

********PR INFO********
Mode :AUTO
#Iterations:2200
Status :PR In Progress

# omreport storage controller
Controller PERC 4e/Di (Embedded)

Controllers
ID

    : 0
Status                            : Ok
Name                              : PERC 4e/Di
Slot ID                           : Embedded
State                             : Ready
Firmware Version                  : 522A
Driver Version                    : Not Applicable
Minimum Required Firmware Version : Not Applicable
Minimum Required Driver Version   : Not Applicable
Number of Channels                : 2
Rebuild Rate                      : 30%
Alarm State                       : Not Applicable
Cluster Mode                      : Not Applicable
SCSI Initiator ID                 : 7

Also, we upgraded the BIOS from A00 to A01, instead of to the latest A04, since the release notes of A02 through A04 didn't read pertinent at the time. At second read of A03's release notes, I noticed the following two fixes that could be relevant to the system. Where can I find more detailed notes other than PE6850-BIOSA03.TXT ? I don't quite understand why the developers or release managers so minced on words.

Added support for Virtualization Technology in the processor.

Should I assume this is not referring to HT, but of special server virtualization assistance from Intel's VT (?) technology or alike ?

Added support for 800MHz system configurations.

Does this mean BIOS prior to A03 doesn't support 800MHZ system configurations?

Although the megaraid* driver is dated early 2005. The CHANGLOG.megraid in /kernel/Documentation doesn't have much interesting changes either.

Tuesday, October 31, 2006

kernel panic when using a 2G ramdisk on a CentOS 4.4 serverCD installation

I was burnt recently with a minimal installation using CentOS 4.4 ServerCD. CentOS project has this one-CD flavor for server use in addition to the regular current 4-CD full distro, mirroring Redhat Enterprise Advanced Server (RHAS). I found it awfully convenient to download only one CD of 530M, instead of the full set of 2.4G. The installations I do at work are all server installations, if my own Linux desktop is not counted. It has been a god-given ever since I first noticed its existence in CentOS 4.2 /iso folder.

Kernel paniced for a new production Sybase ASE database server when Sybase ASE 12.5 database engines started to zero out ~2G tmpdb devices (data & log devices). The hardware specs: Dell PowerEdge 6850 with 16G of DDR2 RAM and quad CPUs. The box survived my own stress test on CPU and disk. However, as an hindsight, no attempt was made to exhaust the memory.

The Sybase seemed to have finished its task, when the DBA called me saying he lost connection to the machine. I attached to its serial console and opened a connection using Kermit. Input was not rejected, but no response (no echo) from the system. After a quick reboot, I found nothing interesting in the log tail around the panic ( syslog.conf is set to write kernel.* to /var/log/messages). So, I decided to log the console session and asked the DBA to start Sybase. Sure enough, it happened again, with tons of 'out of memory' messages, followed by desperate yet fatal attempts by oom-killer to kill all processes to free up memory.
oom-killer: gfp_mask=0xd0
Mem-info: DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
cpu 2 hot: low 2, high 6, batch 1
cpu 2 cold: low 0, high 2, batch 1
cpu 3 hot: low 2, high 6, batch 1
cpu 3 cold: low 0, high 2, batch 1
cpu 4 hot: low 2, high 6, batch 1
cpu 4 cold: low 0, high 2, batch 1
cpu 5 hot: low 2, high 6, batch 1
cpu 5 cold: low 0, high 2, batch 1
cpu 6 hot: low 2, high 6, batch 1
cpu 6 cold: low 0, high 2, batch 1
cpu 7 hot: low 2, high 6, batch 1
cpu 7 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
cpu 2 hot: low 32, high 96, batch 16
cpu 2 cold: low 0, high 32, batch 16
cpu 3 hot: low 32, high 96, batch 16
cpu 3 cold: low 0, high 32, batch 16
cpu 4 hot: low 32, high 96, batch 16
cpu 4 cold: low 0, high 32, batch 16
cpu 5 hot: low 32, high 96, batch 16
cpu 5 cold: low 0, high 32, batch 16
cpu 6 hot: low 32, high 96, batch 16
cpu 6 cold: low 0, high 32, batch 16
cpu 7 hot: low 32, high 96, batch 16
cpu 7 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 cpu 2 hot: low 32, high 96, batch 16 cpu 2 cold: low 0, high 32, batch 16 cpu 3 hot: low 32, high 96, batch 16 cpu 3 cold: low 0, high 32, batch 16 cpu 4 hot: low 32, high 96, batch 16 cpu 4 cold: low 0, high 32, batch 16 cpu 5 hot: low 32, high 96, batch 16 cpu 5 cold: low 0, high 32, batch 16 cpu 6 hot: low 32, high 96, batch 16 cpu 6 cold: low 0, high 32, batch 16 cpu 7 hot: low 32, high 96, batch 16 cpu 7 cold: low 0, high 32, batch 16 Free pages: 12506892kB (12493888kB HighMem) Active:208204 inactive:791964 dirty:101636 writeback:26 unstable:0 free:3126723 slab:20412 mapped:563179 pagetables:DMA free:12564kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:4 all_unreclaimable? yes protections[]: 0 0 0 Normal free:440kB min:928kB low:1856kB high:2784kB active:644696kB inactive:16036kB present:901120kB pages_scanned:1069066 all_unreclaimable? yes protections[]: 0 0 0 HighMem free:12493888kB min:512kB low:1024kB high:1536kB active:180824kB inactive:3159244kB present:16908288kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 DMA: 3*4kB 5*8kB 4*16kB 3*32kB 3*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 2*4096kB = 12564kB Normal: 0*4kB 1*8kB 1*16kB 1*32kB 0*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 440kB HighMem: 1418*4kB 535*8kB 128*16kB 85*32kB 119*64kB 38*128kB 10*256kB 4*512kB 2*1024kB 2*2048kB 3041*4096kB = 12493888kB Swap cache: add 0, delete 0, find 0/0, race 0+0 0 bounce buffer pages Free swap: 16779884kB 4456448 pages of RAM 3963834 pages of HIGHMEM 299626 reserved pages 849369 pages shared 0 pages swap cached Out of Memory: Killed process 6586 (dataserver). oom-killer: gfp_mask=0xd0 Mem-info: DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 cpu 2 hot: low 2, high 6, batch 1 cpu 2 cold: low 0, high 2, batch 1 cpu 3 hot: low 2, high 6, batch 1 cpu 3 cold: low 0, high 2, batch 1 cpu 4 hot: low 2, high 6, batch 1 cpu 4 cold: low 0, high 2, batch 1 cpu 5 hot: low 2, high 6, batch 1 cpu 5 cold: low 0, high 2, batch 1 cpu 6 hot: low 2, high 6, batch 1 cpu 6 cold: low 0, high 2, batch 1 cpu 7 hot: low 32, high 96, batch 16 cpu 7 cold: low 0, high 32, batch 16 Free pages: 12507020kB (12493888kB HighMem) Active:46134 inactive:953784 dirty:101636 writeback:26 unstable:0 free:3126755 slab:20379 mapped:563179 pagetables:1589 DMA free:12564kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:4 all_unreclaimable? yes protections[]: 0 0 0 Normal free:568kB min:928kB low:1856kB high:2784kB active:0kB inactive:659820kB present:901120kB pages_scanned:1845199 all_unreclaimable? no protections[]: 0 0 0 HighMem free:12493888kB min:512kB low:1024kB high:1536kB active:180824kB inactive:3159244kB present:16908288kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 DMA: 3*4kB 5*8kB 4*16kB 3*32kB 3*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 2*4096kB = 12564kB Normal: 20*4kB 3*8kB 3*16kB 1*32kB 0*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 568kB HighMem: 1418*4kB 535*8kB 128*16kB 85*32kB 119*64kB 38*128kB 10*256kB 4*512kB 2*1024kB 2*2048kB 3041*4096kB = 12493888kB Swap cache: add 0, delete 0, find 0/0, race 0+0 0 bounce buffer pages Free swap: 16779884kB 4456448 pages of RAM 3963834 pages of HIGHMEM 299626 reserved pages 849602 pages shared 0 pages swap cached Out of Memory: Killed process 6587 (dataserver). oom-killer: gfp_mask=0xd0 Mem-info: DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 cpu 2 hot: low 2, high 6, batch 1

The server was rebooted fine with sybase up the day before. Today, the DBA got some message claiming the tmpdb devices are not initialized. Since these tmpdb devices are created from scratch when sybase starts, it really didn't make sense! DBA decides to drop the current set of tmpdb devices and create a new set instead. Just when the new set is finishing initialization per Sybase logs, the system stopped responding. Before I lost my console, I were able to issue 'ps -e -o rss,pid,cmd,args|sort -n', 'free', and 'top'. Only about 6G is used out of the total 16G. So, the memory is not really exhausted. So, this got to do with how Kernel is carving and using the memory. Googling ' oom-killer: gfp_mask=0xd0' yields a single post which referred to a bug of sorts by a overly-zealous OOM-killer and how to disable it, due to problem with memory addressing scheume switch between high mem and low mem area.

For that, I wonder if ramdisk was claiming HIMEM. If HIMEM were not addressable or addressed improperly, that may have the appearance of OOM (out of memory). Once I started to examine '/var/log/messages' line by line, there, it was painfully obvious that hugemem kernel was not selected by CentOS ServerCD 4.4. The regular kernel-smp was selected.
Oct 23 12:35:30 syb06 kernel: Linux version 2.6.9-42.ELsmp (buildcentos@build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3 )) #1 SMP Sat Aug 12 09:39:11 CDT 2006
Oct 23 12:35:30 syb06 kernel: BIOS-provided physical RAM map:
Oct 23 12:35:30 syb06 kernel:********************************************************
Oct 23 12:35:30 syb06 kernel: * This system has more than 16 Gigabyte of memory. *
Oct 23 12:35:30 syb06 kernel: * It is recommended that you read the release notes *
Oct 23 12:35:30 syb06 kernel: * that accompany your copy of Red Hat Enterprise Linux *
Oct 23 12:35:30 syb06 kernel: * about the recommended kernel for such configurations *
Oct 23 12:35:30 syb06 kernel: **********************************************

Once I downloaded the corresponding kernel-hugemem and rebooted the server with it, (grub.conf needed to be edited manually to to boot the new kernel by default). All is peachy again.

P.S. To answer Barry's question below, here is the 12-day memory RRD graph. The panic happened on the last Thursday or Friday, where lines were broken due to missing data. It would be nicer if Blogger actually allows image insertion inside comments. -20061030

P.P.S. Obviously the correct selection of the right kernel package and the maintenance thereof had been a problem/pain for Redhat's maintainers and customers. Per its release notes, the newly minted Fedore Core 6 (FC6) has one single kernel package per architecture. Kernel parameters optimized for different hardware specs (SMP or UNP, 4G/8G/16G/32G of RAM, etc.) would be set dynamicly upon boot, instead of being hard coded in different kernel packages: kernel, kernel-smp, and kernel-hugemem. Of course, with RHEL5 beta in the horizon, I guess we won't see such a change (in RHEL6) for a few years :(