Wednesday, November 01, 2006

perc 4e/Di on Dell PE6850 saga continues...part A

We ended up applying BIOS upgrade (A00->A01) and PERC 4e/Di firmware upgrade (521A to 522A A13) for the system lockup problems we had on the production database server running on a Dell PE6850. Home-made load tests didn't cause panic for 18 hours. The server was then rushed back into production since the fail-over spare server couldn't stand the load.

The server (the Sybase database engines) has been up for 14 days today. At 09:50am, just when the server started to ramp up to its daily load peak (CPU load ~=4) , some processes failed to write to the disk and 'date > junk' from cmdline just hang there. I canceled that 'date>junk'. All is good after less than 4 minutes. Nothing interesting (warn/error/abort) in the system log, exportlog from PERC controller, or database log. PR was running at the time.

The symptoms definitely differ, so the BIOS and firmware upgrade did make some difference towards the better. For the previous two lockups and the only two for 15 months, we lost access to the disks totally, getting "reject i/o to offlined disk" without kernel panic or corruption. This time, this is merely a hiccup or pause or suspension of sorts.

Older postings on similar topic on dell-linux-poweredge forum suggested PR could be the culprit if BIOS/firmware is up-to-date. On the system, I get the following output from '"megapr -dispPR -a0" today. Is #Iterations current count of the total PR has run or a threshold or some sort? If the former, how to clear it? If the latter, how to increase? Basically I am looking into why it locked up exactly 30 days (could be coincidence too. and we are now using newer BIOS and firmware). Dell diag from OMSA 4.4 on 10/17/2006 suggests nothing wrong the controller, memory, or underlying disks. (omreport on the controller is appended below too).

********PR INFO********
Mode :AUTO
#Iterations:2200
Status :PR In Progress

# omreport storage controller
Controller PERC 4e/Di (Embedded)

Controllers
ID
: 0
Status : Ok
Name : PERC 4e/Di
Slot ID : Embedded
State : Ready
Firmware Version : 522A
Driver Version : Not Applicable
Minimum Required Firmware Version : Not Applicable
Minimum Required Driver Version : Not Applicable
Number of Channels : 2
Rebuild Rate : 30%
Alarm State : Not Applicable
Cluster Mode : Not Applicable
SCSI Initiator ID : 7

Also, we upgraded the BIOS from A00 to A01, instead of to the latest A04, since the release notes of A02 through A04 didn't read pertinent at the time. At second read of A03's release notes, I noticed the following two fixes that could be relevant to the system. Where can I find more detailed notes other than PE6850-BIOSA03.TXT ? I don't quite understand why the developers or release managers so minced on words.

  • Added support for Virtualization Technology in the processor.
Should I assume this is not referring to HT, but of special server virtualization assistance from Intel's VT (?) technology or alike ?
  • Added support for 800MHz system configurations.
Does this mean BIOS prior to A03 doesn't support 800MHZ system configurations?

Although the megaraid* driver is dated early 2005. The CHANGLOG.megraid in /kernel/Documentation doesn't have much interesting changes either.

Tuesday, October 31, 2006

Fedora Core 6 security improvement :: refined SELinux policy managment and more

After a stock installation of the Fedora Core Linux 6, I am pleasantly surprised at the first boot, with the SELinux policy management GUI at part of the 'firstboot' program. You can collapse to set granular policies for services of your interest. For each service, all you need to do is to check or to uncheck policy items .Such a nice GUI management would definitely help greatly to unleash the raw power of SELinux to system administrators or even security administrators who would otherwise have to spend much more time to study, research, and to manage SELinux policy effectively.

The same SELinux policy management tool is readily accessible under System-> Administration->Security Level and Firewall. I guess the name of the applet has not really caught up with this new integration.

Of course, a system won't be secure unless it can be kept up-to-date with all good security patches and updates. In that regard, FC6 is improved as well. The update notification applet is improved to make the update process point-and-click compliant. You will get Package Updater running all GUI to assist regular desktop users to keep their system up to date.