Monday, November 27, 2006

Garmin StreetPilot C320 :: review by Experience

I bought a new Garmin StreetPilot C320 gps right before my trip to Florida over the thanksgiving weekend.
Pros: It worked for most of the part.
Cons:
  • The online registration to unlock the map was extra hassle I didn't expected.
  • Too much other junk in the map causing a 128M SD card hold no more than three states at a time. I was told some other brand has less than 4M a state.
  • Reception is poor. It failed to make its first round of contacts with satellites even if I placed it on the picnic table on the back deck or the front patio stairs. One time on interstate, it failed too, when there's a 16-wheeler nearby. Consider the radio still works, you'd think they should augment/supplement the signal somehow to make it work too. Otherwise, you simply can't drive in a crowded traffic and expect it to guide you.

Here is the other parts when it failed to work or could use some improvements:
  • In St Augustine Beach, FL, I was directed to turn left to Mandrid St for my destination from SR A1A South. It turned out Mandrid street was blocked by a private gate.
  • In Atlanta, GA, I was instructed to turn left in 4 miles off 285 to my house. Nothing was mentioned about a right turn (T-junction) from Ashford-Dunwoody Rd to Mt. Vernon Rd.
  • In Kissimmee, FL, I was directed to a Super Wal-Mart 3.2miles away from the resort we stayed in instead of the one 1.2 miles away. Apparently both had been there for quite few years already.
  • In St. Augustine, FL, we wanted to go to a Thai restaurant. However, the unit complained it couldn't find '4900 SR US 1". By searching 'shopping' on the unit, however, we located a Publix at '1127 SR US 1'. It has been more than one occasions when it claimed no such address can be found even though the whole FL state is loaded.
  • Driving back from Orlando, FL to Atlanta, GA, it said turn in 310 miles when we paid $2.50 at the 2nd toll plaza. I figured $2.50 is cheap for a 300-mile toll road. It turned out it failed to mention the merge from Florida turnpike to I-75 N in 27 miles.

The extra junk turned out to be very handy when we were in Orlando
  • To find a super Wal-Mart to buy a special brand of organic milk for my son, I just selected 'shopping' and spotted the first super Wal-Mart, as the list is sorted by distance from your current location. We left our laptops home on purpose even though places we stayed have free WI-FI. Without the GPS, I guess we'd have to drive around to spot one on the run. We found a CVS and a Publix the same way.
  • We had tickets for Downtown Disney. However, both the tickets and brochures don't list street address. I guess the smart Disney writers assumed everybody know where it is ?! The unit listed it under 'theme parks' as 'Disney Quest'. That solved our problem pretty easily.
All in all, it eliminated quite some headache for my wife and me.

Friday, November 10, 2006

perc 4e/Di on Dell PE6850 saga continues...part C

With the load from the full database dump plus the application load burst we set up last Friday , the problematic server 'syb04' generated a few alerts over the weekend. The alerts complained the stamp didn't show up right after the 'logger' call. We were very excited, thinking we were able to reproduce the problem this quick. The next thing would be just to pick out what to upgrade from a decent list of potential upgrades.

Examining closely the local log as well as the log on the remote syslogd server, however, showed that such 'missing' stamps appeared up right after the complaints of their absence. Most were within the same second, and only one was one second late. Therefore such alerts were identified as false positives.

Needless to say, we were very disappointed. Even worse, this remained the case for the full week. We started to toss around the idea maybe the hiccup was merely a delay and we overacted a little by flipping the switch.

To "add insult to the injury", our constant attention was demanded by a lot of database problems related to application peak load which was coerced to repeat. The problems were:
  • Sybase ASE log device filled up, causing the application peak load come to a sudden halt, until Sybase is restarted with log cleared.
  • Hourly transaction has grown from 20M each to over 1G each. It seemed like some transaction failed to be committed.
  • In turn the transaction dumps filled up the disk.
So far, the server endured seven 24-hour days of load run, which totals 24x7x3= $504 load peaks. A regular day have only 2 peaks, therefore, this equals to 250 days worth of load.

I am betting a small sum of money on the PR (patrol read), whose background scheduling may be surprised by the sudden spike in disk IO caused by the nightly full database backup as well as the daily application peak. To force PR to collide with the load, I wrote a script to check PR status and start one if none is 'In Progress' already, as reported by 'megapr -dispPR -a0'.

BTW, the 'megapr -dispPR -a0' command alone causes the following errors in PERC controller's exportlog. My inquiry on this error got no response from Dell's linux-PowerEdge forum, which is monitored by a few Dell engineers.
11/07 10:25:51: MPT_Rec: INQ Error - Negotiating LD[6] pRfm a07517c0
11/07 10:25:51: MPT_Rec: INQ Error - Negotiating LD[16] pRfm a0743360
11/07 10:25:51: GET: SCSI_chn=ff, rtn status=0

Tuesday, November 07, 2006

debugfs :: handy utility to debug an ext2 or ext3 file system

Today I am learning a useful utility program named 'debugfs'. It is part of e2fsprogs package, an essential package containing axillary programs for ext2 and ext3 file system under Linux.

For a regular file, 'debugfs' can help you find an inode by any data block the file or dir entry is using. Then you can turn around and ask for the name of the inode. This could be handy when some mysterious files causing df and du to disagree whether the filie system is full, or the file system is corrupted or can't mounted to be accessed as usual. More advanced file system features are available too.

# to find what inode is claiming a given data block
# debugfs -R "icheck 12345" /dev/hda1
debugfs 1.35 (28-Feb-2004) Block Inode number 12345 340

# to find the file name given the inode number
# debugfs -R "ncheck 49153" /dev/hda1 debugfs 1.35 (28-Feb-2004) Inode Pathname
49153 /usr/share/locale/ar/LC_MESSAGES/libbonobo-2.0.mo


# Print the location of the inode data structure
# debugfs -R "imap /boot/vmlinuz-2.6.9-42.0.2.EL" /dev/hda1 debugfs 1.35 (28-Feb-2004)
Inode 557516 is part of block group 34
located at block 1114128, offset 0x0580


# to dump the direntry (filespec, per man page)
debugfs -R "dump -p /boot/vmlinuz- 2.6.9-42.0.2.EL /tmp/vmlinuz_dumped" /dev/hda1
# md5sum /boot/vmlinuz-2.6.9-42.0.2.EL /tmp/vmlinuz_dumped
e5c536b539b5ffcaa03b22bd7fcc164a /boot/vmlinuz-2.6.9-42.0.2.EL e5c536b539b5ffcaa03b22bd7fcc164a /tmp/vmlinuz_dumped

# to get the contents of a file, assume the fs can't be mounted and accessed the usually way.
# debugfs -R "cat /etc/redhat-release" /dev/hda1
debugfs 1.35 (28-Feb-2004)
CentOS release 4.4 (Final)

Noteworthy is, for files under /selinux ( a pseudo fs), it can find inode number associated with a data block. However, it couldn't find the file name for the very inode number.
# debugfs -R "ncheck 8" /dev/hda1 debugfs 1.35 (28-Feb-2004)
Inode Pathname
8
# find / -inum 8
/selinux/relabel
# ls -id /selinux/relabel

8 /selinux/relabel
# debugfs -R "icheck 4567" /dev/hda1
debugfs 1.35 (28-Feb-2004) Block Inode number
4567 8

# / is on /dev/hda1
/dev/hda1 8127400 6738524 1306308 84% /

There are a lot of powerful (and dangerous) features such as
  • feature you can set or clear various file system features in the superblock
  • freeb to mark data blocks as unallocated vs. setb
  • freei to free the inode specified
  • clri to clear the contents of the inode
  • chroot to chroot to the directory
  • find_free_block
  • find_free_inode
  • init_filesys to create an ext2 file system
  • kill_file deallocate the file and its blocks. It doesn't remove any direntry to this inode. not ' rm' or 'unlink'.
  • logdump to dump the ext3 journal
  • modify_inode modify the contents of the inode structure
  • ls/mkdir/mknod/rm/rmdir
'debugfs' starts interactively by default, unless you have '-R' to request one-time use only. A session would be like below:
# debugfs
debugfs 1.35 (28-Feb-2004)

debugfs: open /dev/hda1
debugfs: icheck 12345
Block Inode number
12345 340

debugfs: ncheck 340
Inode Pathname
340 /usr/X11R6/lib/xscreensaver/mountain

debugfs: close

debugfs: quit

AUUG November meeting :: Legal Concerns on Blogger & GPLv3

7:26pm last night, I finally made my way into the HP building across I-285 from Perimeter Mall. I couldn't find the front entry with some contruction blocked some part of the road, so I pulled in wherever I can into the only standing-tall building. A few employees passed me from the next lane and were kind enough to roll down their windows to inform me that a badge is required for after-hour access. I buzzed the security guy. Once I told him that I was here for the AUUG meeting, he attempted to buzz me in a few times, to no avail. So, he dispatched someone to come down to get it for me. While I was parking my car, he was standing there holding the door for me. I am really touched.
The presentation is on legal considerations for Bloggers and legal implications of new provisions in GPLv3. It is well done by two lawyers from Manning, Morris, and Somebody, a local law firm and one of the corporate sponsors of AUUG. I knew Linus was not a fan of GPLv3 because of some of the restrictive clauses. It is much clear made the presenter, the motivation behind it is some kinda vendetta against commercialism of some sort by an academic and a pure idealist.

Let me quote below the three most problematic clauses in GPLv3, as discussed in the presentation
  • patent retaliation
  • bundle rights
  • no locked key [ The lawyer's own research ]

Saturday, November 04, 2006

perc 4e/Di on Dell PE6850 saga continues...part B

After searching up & down, I compiled a decent list of potential upgrades and toggles to try out on syb04. None of them is apparently pertinent enough to have you say 'ahh-ha'. I purchased and put into production a new server named syb06, the one killed by oom-killer and cured by kernel-hugemem. With the improved production configuration mix, time is more affordable than last time when syb04 locked up. So, our team of 'experts' decided to reproduce the recent hiccup or the older lockup problem reliably before we attempt a fix this time around.

Hobbit Monitor is running all the servers, so it is rather easy to catch the old lockup problem wherein all checks went to 'purple', as in 'stale', or 'no report received' status. It is a bit tricky to detect when a hiccup happens. If it happens squarely inside the 5-minute interval Hobbit Monitor uses, we'd miss the signal! It seems it is not all that easy to change monitor frequency down to 1 minute for one single client, as nobody has answered my question on the Hobbit mailing list for three days now. After much discussion of alternatives, I come up with a way and verified it works.

With the monitor fine-tuned and focused on syb04, load is added to it first. Count full nightly database backup and daily peak as two load situations, we need have at least 28 peaks to equate to the 14 days leading up to the lockup. The nightly database backup takes only 25 minutes, and is very easy to run it continuously by simply changing cron schedule to every 30 minutes instead of every day. So, we did that. After 20 hours (~= 40 load peaks), nothing happened. Since we don't plan to work over the weekend, it is decided to simulate the daily load peak and let it run continuously. It took some Java code change and it is done. So, we'd have both the application load and the backup load against the server over the weekend.

* fingers-crossed *

Maxtor Personal Storage 3200 320 GB external USB 2.0 drive to stage backups under Linux

Today I bought a Maxtor Personal Storage 3200 320 GB (Model U01H320) from Circuit City. 320G with 8M cache for merely $159. They matched their online price by taking $20 off the $179 in-store price. Amazon.com is selling the same item on behalf of CompUSA for $169.
The drive will be used to stage backups at our co-lo. At work, we've waited too long for a budget to come through for a real NAS or a tape library/magzine/autoloader. It is one of those day-to-day challenges of small businesses which face the same challenges with much more limited resources. The daily full backup set now amounts to 12G, all compressed by gzip, my favorite tool.

From a RHEL 4.4 AS (CentOS 4.4, to be more exact) guest inside VMWare, the drive was detected as USB 1.1 (full speed) instead of USB 2.0. I assume it is a limitation of the VMWare emulation.
usb 1-1: new full speed USB device using address 2
Initializing USB Mass Storage driver...

scsi1 : SCSI emulation for USB Mass Storage devices

Vendor: Maxtor Model: 3200 Rev: 0341

Type: Direct-Access ANSI SCSI revision: 02
USB Mass Storage device found at 2
usbcore: registered new driver usb-storage
USB Mass Storage support registered.
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: assuming drive cache: write through
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: assuming drive cache: write through
sda: sda1
Attached scsi disk sda at scsi1, channel 0, id 0, lun 0

Pretty interesting to see it actually has NTFS as partition type. wonder how it would fly with MacOS out of the box?
Disk /dev/sda: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 38913 312568641 7 HPFS/NTFS
Since I am gonna use it under RHEL 4, I go ahead relabel the partition as 'Linux' and formatted it as EXT3.
/root# mke2fs -L E320G -O sparse_super,dir_index,filetype -T largefile4 -j /dev/sda1
mke2fs 1.35 (28-Feb-2004)
Filesystem label=E320G
OS type: Linux

Block size=4096 (log=2)
Fragment size=4096 (log=2)
76320 inodes, 78142160 blocks
3907108 blocks (5.00%) reserved for the super user

First data block=0

Maximum filesystem blocks=79691776 2385 block groups
32768 blocks per group, 32768 fragments per group
32 inodes per group
Superblock backups stored on blocks:

32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616 Writing inode tables: done
Creating journal (8192 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 33 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

Ahh, I always forgot to reduce the percentage reserved for super user from the default 5% to 1%.
/dev/sda1 312538740 98368 296811940 1% /mnt
/root# tune2fs -m 1 /dev/sda1
tune2fs 1.35 (28-Feb-2004)
Setting reserved blocks percentage to 1 (781421 blocks)

/root# df -kv /mnt

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 312538740 98368 309314688 1% /mnt

Friday, November 03, 2006

Comcast connection problem (Motorolla SF4100) for recent two weeks

I have been having connection problem with Comcast connection problem (Motorola SF4100) for the past two weeks. The speed was not stable at all and may even lost connection. Why the broadband ISP can't provide a robust service just running water or electricity? Is it because
we as customers are not that demanding
  • we fail to scream at them for each every outage or service degradation
  • we fail to find time to talk to our legislators
  • we fail to fail to vote with our feet
  • we fail to penalize them even within the current framework. If everyone can just report the problem, ask for a ticket be opened, calling back to Billing with the ticket number to get a refund! This would help in two fronts
    • the sheer volume of calls/tickets would open some eyes when BI or data warehouse report is in
    • the sheer volume of the refund check would cut into the profit margin
    • the faster and more we complain, the sooner they stop using us as a measuring gauge whether the service is up or not
  • For the least, we should demand a registration be passed so that refund for system-wide outages be automated and to all customer base, instead of the current more-hoopla-for-you scheme: refund is only available for those who went through the trouble reporting the problem, asking for a ticket, calling back to Billing with the ticket number to get a teeny-weeny refund.

Here is the symptoms of the week:
  • power-cycle the cable modem and/or wireless router may or may not help.
  • download speed ranges from 200K to .4KBps to a stall. several times within a hour.
  • Web interface of the cable modem itself shows the following error/warnings. I looked at it when it works properly, it should be all Debug/Informational.
  • When I asked 'router' to 'renew' address, it failed to renew for over a dozens of retries and timed out.
  • Speed test via http://www.speakeasy.net/speedtest/ shows the connection is not stable at all: 600KB/14KB at one point, and 200KB/104KB at another. It really should stay around 700KB/80KB.
061103155029 8-Debug F504.1 Bridge Ethernet Hook. Failed to learn CPE MAC Address.
061103155029 4-Error F507.5 MAC Filters. Add MAC Address can't add entry. Table is full.
061103155019 8-Debug M570.2 Motorola CM certificate present
061103155019 8-Debug M571.7 CM Cert Upgrade Enabled. Initiate after Registration
061103155019 8-Debug I503.0 Cable Modem is OPERATIONAL
061103155019 7-Information B401.0 Authorized
061103155018 8-Debug F502.1 Bridge Forwarding Enabled.
061103155018 8-Debug F502.3 Bridge Learning Enabled.
061103155018 7-Information B0.0 Baseline Privacy
061103155016 7-Information X518.9 Configuration - GGFMMD - Unit Update Enabled by CVC
061103155016 8-Debug I500.1 DOCSIS 1.0 Registration Completed
061103155016 7-Information I500.4 Attempting DOCSIS 1.0 Registration
061103155016 7-Information D509.0 Retrieved TFTP Config File SUCCESS
061103155013 7-Information D507.0 Retrieved Time....... SUCCESS
061103155013 7-Information D511.0 Retrieved DHCP .......... SUCCESS
061103155013 5-Warning D3.0 DHCP WARNING - Non-critical field invalid in response
061103155013 4-Error D530.8 DHCP - Invalid Log Server IP Address.
061103155013 5-Warning D520.2 DHCP Attempt# 6 BkOff: 5s Tot DSC:6 OFF:3 REQ:3 ACK:1
061103155013 3-Critical D1.0 DHCP FAILED - Discover sent, no offer received
061103154959 5-Warning D520.2 DHCP Attempt# 4 BkOff:27s Tot DSC:4 OFF:2 REQ:2 ACK:0
061103154959 3-Critical D1.0 DHCP FAILED - Discover sent, no offer received
061103154932 5-Warning D520.2 DHCP Attempt# 3 BkOff:13s Tot DSC:3 OFF:2 REQ:2 ACK:0
061103154932 3-Critical D2.0 DHCP FAILED - Request sent, No response
061103154927 5-Warning D520.2 DHCP Attempt# 2 BkOff: 4s Tot DSC:2 OFF:1 REQ:1 ACK:0
061103154927 3-Critical D1.0 DHCP FAILED - Discover sent, no offer received
061103154923 5-Warning D520.2 DHCP Attempt# 1 BkOff: 4s Tot DSC:1 OFF:1 REQ:1 ACK:0
061103154923 3-Critical D2.0 DHCP FAILED - Request sent, No response
061103154918 7-Information D0.0 DHCP CM Net Configuration download and Time of Day
061103154918 7-Information T500.0 Acquired Upstream .......... SUCCESS
061103154918 8-Debug T503.1 Acquire US with status OK, powerLevel 19, tempSid 1378
061103154918 8-Debug T505.0 Acquired Upstream with status OK
061103154916 7-Information T501.0 Acquired Downstream (687000000 Hz)........ SUCCESS
061103154916 8-Debug T509.0 Acquired DS with status OK, DS Freq 687000000, US Id 5
061103154906 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154905 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 105000000, US Id 0
061103154905 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154905 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 99000000, US Id 0
061103154905 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154905 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 93000000, US Id 0
061103154905 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154904 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 855000000, US Id 0
061103154904 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154904 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 849000000, US Id 0
061103154904 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154903 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 843000000, US Id 0
061103154903 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154903 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 837000000, US Id 0
061103154903 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154902 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 831000000, US Id 0
061103154902 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154902 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 825000000, US Id 0
061103154902 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154901 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 819000000, US Id 0
061103154901 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154901 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 813000000, US Id 0
061103154901 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154900 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 807000000, US Id 0
061103154900 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154900 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 801000000, US Id 0
061103154900 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154859 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 795000000, US Id 0
061103154859 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154859 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 789000000, US Id 0
061103154859 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154859 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 783000000, US Id 0
061103154859 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154858 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 777000000, US Id 0
061103154858 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154858 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 771000000, US Id 0
061103154858 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154857 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 765000000, US Id 0
061103154857 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154857 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 759000000, US Id 0
061103154857 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154856 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 753000000, US Id 0
061103154856 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154856 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 747000000, US Id 0
061103154856 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154855 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 741000000, US Id 0
061103154855 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154855 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 735000000, US Id 0
061103154855 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154855 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 729000000, US Id 0
061103154855 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154854 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 723000000, US Id 0
061103154854 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154854 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 717000000, US Id 0
061103154854 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154853 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 711000000, US Id 0
061103154853 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154853 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 705000000, US Id 0
061103154853 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154852 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 699000000, US Id 0
061103154852 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154852 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 693000000, US Id 0
061103154852 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154851 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 687000000, US Id 0
061103154851 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154851 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 681000000, US Id 0
061103154851 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154850 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 675000000, US Id 0
061103154850 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154850 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 669000000, US Id 0
061103154850 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154850 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 663000000, US Id 0
061103154850 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154849 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 657000000, US Id 0
061103154849 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154849 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 651000000, US Id 0
061103154849 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154848 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 645000000, US Id 0
061103154848 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154848 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 639000000, US Id 0
061103154848 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154847 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 633000000, US Id 0
061103154847 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154847 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 627000000, US Id 0
061103154847 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154847 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 621000000, US Id 0
061103154847 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154846 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 615000000, US Id 0
061103154846 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154846 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 609000000, US Id 0
061103154846 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154845 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 603000000, US Id 0
061103154845 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154845 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 597000000, US Id 0
061103154845 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154845 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 591000000, US Id 0
061103154845 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154844 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 585000000, US Id 0
061103154844 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154844 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 579000000, US Id 0
061103154844 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154843 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 573000000, US Id 0
061103154843 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154843 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 567000000, US Id 0
061103154843 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154842 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 561000000, US Id 0
061103154842 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154842 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 555000000, US Id 0
061103154842 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154842 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 549000000, US Id 0
061103154842 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154841 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 543000000, US Id 0
061103154841 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154841 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 537000000, US Id 0
061103154841 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154840 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 531000000, US Id 0
061103154840 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154840 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 525000000, US Id 0
061103154840 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154840 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 519000000, US Id 0
061103154840 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154839 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 513000000, US Id 0
061103154839 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154839 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 507000000, US Id 0
061103154839 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154838 8-Debug T509.0 Acquired DS with status NO FEC lock, DS Freq 501000000, US Id 0
061103154838 3-Critical T2.0 SYNC Timing Synchronization failure - Failed to acquire FEC framing
061103154837 7-Information D519.0 DHCP Client shutting down.
061103154837 7-Information H501.2 HFC: Shutting Downstream Down
061103154837 3-Critical I2.0 REG RSP not received
061103154837 1-Emergency I506.0
061103154828 7-Information I500.4 Attempting DOCSIS 1.0 Registration
061103154828 7-Information D509.0 Retrieved TFTP Config File SUCCESS
061103154828 7-Information D507.0 Retrieved Time....... SUCCESS
************ 7-Information D511.0 Retrieved DHCP .......... SUCCESS
************ 5-Warning D3.0 DHCP WARNING - Non-critical field invalid in response
************ 4-Error D530.8 DHCP - Invalid Log Server IP Address.
************ 5-Warning D520.2 DHCP Attempt# 1 BkOff: 5s Tot DSC:1 OFF:1 REQ:1 ACK:1
************ 7-Information D0.0 DHCP CM Net Configuration download and Time of Day
************ 7-Information T500.0 Acquired Upstream .......... SUCCESS

Wednesday, November 01, 2006

perc 4e/Di on Dell PE6850 saga continues...part A

We ended up applying BIOS upgrade (A00->A01) and PERC 4e/Di firmware upgrade (521A to 522A A13) for the system lockup problems we had on the production database server running on a Dell PE6850. Home-made load tests didn't cause panic for 18 hours. The server was then rushed back into production since the fail-over spare server couldn't stand the load.

The server (the Sybase database engines) has been up for 14 days today. At 09:50am, just when the server started to ramp up to its daily load peak (CPU load ~=4) , some processes failed to write to the disk and 'date > junk' from cmdline just hang there. I canceled that 'date>junk'. All is good after less than 4 minutes. Nothing interesting (warn/error/abort) in the system log, exportlog from PERC controller, or database log. PR was running at the time.

The symptoms definitely differ, so the BIOS and firmware upgrade did make some difference towards the better. For the previous two lockups and the only two for 15 months, we lost access to the disks totally, getting "reject i/o to offlined disk" without kernel panic or corruption. This time, this is merely a hiccup or pause or suspension of sorts.

Older postings on similar topic on dell-linux-poweredge forum suggested PR could be the culprit if BIOS/firmware is up-to-date. On the system, I get the following output from '"megapr -dispPR -a0" today. Is #Iterations current count of the total PR has run or a threshold or some sort? If the former, how to clear it? If the latter, how to increase? Basically I am looking into why it locked up exactly 30 days (could be coincidence too. and we are now using newer BIOS and firmware). Dell diag from OMSA 4.4 on 10/17/2006 suggests nothing wrong the controller, memory, or underlying disks. (omreport on the controller is appended below too).

********PR INFO********
Mode :AUTO
#Iterations:2200
Status :PR In Progress

# omreport storage controller
Controller PERC 4e/Di (Embedded)

Controllers
ID
: 0
Status : Ok
Name : PERC 4e/Di
Slot ID : Embedded
State : Ready
Firmware Version : 522A
Driver Version : Not Applicable
Minimum Required Firmware Version : Not Applicable
Minimum Required Driver Version : Not Applicable
Number of Channels : 2
Rebuild Rate : 30%
Alarm State : Not Applicable
Cluster Mode : Not Applicable
SCSI Initiator ID : 7

Also, we upgraded the BIOS from A00 to A01, instead of to the latest A04, since the release notes of A02 through A04 didn't read pertinent at the time. At second read of A03's release notes, I noticed the following two fixes that could be relevant to the system. Where can I find more detailed notes other than PE6850-BIOSA03.TXT ? I don't quite understand why the developers or release managers so minced on words.

  • Added support for Virtualization Technology in the processor.
Should I assume this is not referring to HT, but of special server virtualization assistance from Intel's VT (?) technology or alike ?
  • Added support for 800MHz system configurations.
Does this mean BIOS prior to A03 doesn't support 800MHZ system configurations?

Although the megaraid* driver is dated early 2005. The CHANGLOG.megraid in /kernel/Documentation doesn't have much interesting changes either.