miles wide miles deep: PERC

Showing posts with label PERC. Show all posts

Friday, November 10, 2006

perc 4e/Di on Dell PE6850 saga continues...part C

With the load from the full database dump plus the application load burst we set up last Friday , the problematic server 'syb04' generated a few alerts over the weekend. The alerts complained the stamp didn't show up right after the 'logger' call. We were very excited, thinking we were able to reproduce the problem this quick. The next thing would be just to pick out what to upgrade from a decent list of potential upgrades.

Examining closely the local log as well as the log on the remote syslogd server, however, showed that such 'missing' stamps appeared up right after the complaints of their absence. Most were within the same second, and only one was one second late. Therefore such alerts were identified as false positives.

Needless to say, we were very disappointed. Even worse, this remained the case for the full week. We started to toss around the idea maybe the hiccup was merely a delay and we overacted a little by flipping the switch.

To "add insult to the injury", our constant attention was demanded by a lot of database problems related to application peak load which was coerced to repeat. The problems were:

Sybase ASE log device filled up, causing the application peak load come to a sudden halt, until Sybase is restarted with log cleared.
Hourly transaction has grown from 20M each to over 1G each. It seemed like some transaction failed to be committed.
In turn the transaction dumps filled up the disk.

So far, the server endured seven 24-hour days of load run, which totals 24x7x3= $504 load peaks. A regular day have only 2 peaks, therefore, this equals to 250 days worth of load.

I am betting a small sum of money on the PR (patrol read), whose background scheduling may be surprised by the sudden spike in disk IO caused by the nightly full database backup as well as the daily application peak. To force PR to collide with the load, I wrote a script to check PR status and start one if none is 'In Progress' already, as reported by 'megapr -dispPR -a0'.

BTW, the 'megapr -dispPR -a0' command alone causes the following errors in PERC controller's exportlog. My inquiry on this error got no response from Dell's linux-PowerEdge forum, which is monitored by a few Dell engineers.
11/07 10:25:51: MPT_Rec: INQ Error - Negotiating LD[6] pRfm a07517c0
11/07 10:25:51: MPT_Rec: INQ Error - Negotiating LD[16] pRfm a0743360
11/07 10:25:51: GET: SCSI_chn=ff, rtn status=0

Wednesday, November 01, 2006

perc 4e/Di on Dell PE6850 saga continues...part A

We ended up applying BIOS upgrade (A00->A01) and PERC 4e/Di firmware upgrade (521A to 522A A13) for the system lockup problems we had on the production database server running on a Dell PE6850. Home-made load tests didn't cause panic for 18 hours. The server was then rushed back into production since the fail-over spare server couldn't stand the load.

The server (the Sybase database engines) has been up for 14 days today. At 09:50am, just when the server started to ramp up to its daily load peak (CPU load ~=4) , some processes failed to write to the disk and 'date > junk' from cmdline just hang there. I canceled that 'date>junk'. All is good after less than 4 minutes. Nothing interesting (warn/error/abort) in the system log, exportlog from PERC controller, or database log. PR was running at the time.

The symptoms definitely differ, so the BIOS and firmware upgrade did make some difference towards the better. For the previous two lockups and the only two for 15 months, we lost access to the disks totally, getting "reject i/o to offlined disk" without kernel panic or corruption. This time, this is merely a hiccup or pause or suspension of sorts.

Older postings on similar topic on dell-linux-poweredge forum suggested PR could be the culprit if BIOS/firmware is up-to-date. On the system, I get the following output from '"megapr -dispPR -a0" today. Is #Iterations current count of the total PR has run or a threshold or some sort? If the former, how to clear it? If the latter, how to increase? Basically I am looking into why it locked up exactly 30 days (could be coincidence too. and we are now using newer BIOS and firmware). Dell diag from OMSA 4.4 on 10/17/2006 suggests nothing wrong the controller, memory, or underlying disks. (omreport on the controller is appended below too).

********PR INFO********
Mode :AUTO
#Iterations:2200
Status :PR In Progress

# omreport storage controller
Controller PERC 4e/Di (Embedded)

Controllers
ID

    : 0
Status                            : Ok
Name                              : PERC 4e/Di
Slot ID                           : Embedded
State                             : Ready
Firmware Version                  : 522A
Driver Version                    : Not Applicable
Minimum Required Firmware Version : Not Applicable
Minimum Required Driver Version   : Not Applicable
Number of Channels                : 2
Rebuild Rate                      : 30%
Alarm State                       : Not Applicable
Cluster Mode                      : Not Applicable
SCSI Initiator ID                 : 7

Also, we upgraded the BIOS from A00 to A01, instead of to the latest A04, since the release notes of A02 through A04 didn't read pertinent at the time. At second read of A03's release notes, I noticed the following two fixes that could be relevant to the system. Where can I find more detailed notes other than PE6850-BIOSA03.TXT ? I don't quite understand why the developers or release managers so minced on words.

Added support for Virtualization Technology in the processor.

Should I assume this is not referring to HT, but of special server virtualization assistance from Intel's VT (?) technology or alike ?

Added support for 800MHz system configurations.

Does this mean BIOS prior to A03 doesn't support 800MHZ system configurations?

Although the megaraid* driver is dated early 2005. The CHANGLOG.megraid in /kernel/Documentation doesn't have much interesting changes either.

Wednesday, October 18, 2006

PERC 4e/di went offline on a Dell PowerEdge 6850, with nothing apparently bad

I got paged 2:15 in the morning, only to find the main RAID controller on a Dell PowerEdge 6850 yanked itself away from under the running OS (CentOS 4.1/i386), again. The OS seemed to be fine, except any task involving read/write to the disk failed.

The PE6850 server was bought new from Dell last June, and has been in production since. Same day same time last month, it gave us its first outage showing the same symptoms. Questions we asked ourselves are: why starts now? And why exactly one month apart?

The serial console showed the following message repeating on the screen:

  scsi0 (0:0): rejecting I/O to offline device  EXT3-fs error (device sda2): ext3_find_entry: reading directory #178817 offset 0

The nightly full database backup kicked off at 2:00 and should finish by 2:45am.
Log exported from the PERC 4e/di controller showed that PR (patrol read) is scheduled to run every four hours and one started at 1:00 and didn't get to finish either.
Same RAID controller log also has two interesting entries that coincided with the two outages. These are the only two "ProcessHostDmaInterrupt: No requests active" for a log stretching back to last June. The DMA thingy made the BIOS update A01 interesting in that it corrects memory address assignment or claims by the raid controller ( The server has 16G DDR2 ECC RAM, 2G chips)

09/18  2:05:47: ProcessHostDmaInterrupt: No requests active! (ch=a0102be8)
09/18  2:05:47: ProcessHostDmaInterrupt: No requests active! (ch=a0102be8)
10/18  2:13:34: ProcessHostDmaInterrupt: No requests active! (ch=a0102be8)
10/18  2:13:34: ProcessHostDmaInterrupt: No requests active! (ch=a0102be8)

Some postings suggests turning off PR may help in case it conflicts with heavy disk I/O from application tasks. However, at time of both outages, the disk I/O and the CPU load is not the highest, as we've seen much higher ones during the day every work day, per the baselines neatly graphed using RRDTOOL by our new Hobbit Monitor.
Dell Diag is run and nothing is reported wrong, for the memory, disks, RAID controller and such.

Checking Dell's Driver download site with the server's service tag turned up a few interesting potential fixes:

LSI Logic Perc 4e/Di, v.522A, A13 [[We are at 521S, A00]] Release Date: 9/11/2006

"Fixed an issue that could cause a blue screen, file system error or system hang when using EVPD inquiry commands."
Anybody has any idea what is exactly "an issue" ?!

BIOS upgrade A01 [[ we are at A00 ]]

Added workaround for lockup resulting from the systems with 8GB RAM or more and RAID storage controller potentially claiming inappropriate addresses.

SCSI controller firmware update JT00 for various Maxtor drives

"Under certain circumstances a hard disk drive may go offline, hard disk drives (HDD), may report offline due to a timeout condition. If the HDD is unable to complete commands, this may result in the controller reporting the HDD off line due to a timeout condition."
"Higher than expected failures rates have been reported on the Maxtor Atlas 10K V ULD (unleaded or lead free) SCSI hard disk drives. If the hard disk drive (HDD) is unable to complete commands, this may result in the controller reporting the HDD offline due to the timeout condition. The primary failure modes have been the HDD failing to successfully rebuild and also failing after a rebuild has completed."
Sounds promising, but it turned out that drives in our PE6850 are all Seagate.

miles wide miles deep