Wednesday, October 18, 2006

PERC 4e/di went offline on a Dell PowerEdge 6850, with nothing apparently bad

I got paged 2:15 in the morning, only to find the main RAID controller on a Dell PowerEdge 6850 yanked itself away from under the running OS (CentOS 4.1/i386), again. The OS seemed to be fine, except any task involving read/write to the disk failed.

The PE6850 server was bought new from Dell last June, and has been in production since. Same day same time last month, it gave us its first outage showing the same symptoms. Questions we asked ourselves are: why starts now? And why exactly one month apart?
  • The serial console showed the following message repeating on the screen:
scsi0 (0:0): rejecting I/O to offline device EXT3-fs error (device sda2): ext3_find_entry: reading directory #178817 offset 0
  • The nightly full database backup kicked off at 2:00 and should finish by 2:45am.
  • Log exported from the PERC 4e/di controller showed that PR (patrol read) is scheduled to run every four hours and one started at 1:00 and didn't get to finish either.
  • Same RAID controller log also has two interesting entries that coincided with the two outages. These are the only two "ProcessHostDmaInterrupt: No requests active" for a log stretching back to last June. The DMA thingy made the BIOS update A01 interesting in that it corrects memory address assignment or claims by the raid controller ( The server has 16G DDR2 ECC RAM, 2G chips)
09/18 2:05:47: ProcessHostDmaInterrupt: No requests active! (ch=a0102be8)
09/18 2:05:47: ProcessHostDmaInterrupt: No requests active! (ch=a0102be8)
10/18 2:13:34: ProcessHostDmaInterrupt: No requests active! (ch=a0102be8)
10/18 2:13:34: ProcessHostDmaInterrupt: No requests active! (ch=a0102be8)
  • Some postings suggests turning off PR may help in case it conflicts with heavy disk I/O from application tasks. However, at time of both outages, the disk I/O and the CPU load is not the highest, as we've seen much higher ones during the day every work day, per the baselines neatly graphed using RRDTOOL by our new Hobbit Monitor.
  • Dell Diag is run and nothing is reported wrong, for the memory, disks, RAID controller and such.
Checking Dell's Driver download site with the server's service tag turned up a few interesting potential fixes:
  • LSI Logic Perc 4e/Di, v.522A, A13 [[We are at 521S, A00]] Release Date: 9/11/2006
    • "Fixed an issue that could cause a blue screen, file system error or system hang when using EVPD inquiry commands."
    • Anybody has any idea what is exactly "an issue" ?!
  • BIOS upgrade A01 [[ we are at A00 ]]
    • Added workaround for lockup resulting from the systems with 8GB RAM or more and RAID storage controller potentially claiming inappropriate addresses.
  • SCSI controller firmware update JT00 for various Maxtor drives
    • "Under certain circumstances a hard disk drive may go offline, hard disk drives (HDD), may report offline due to a timeout condition. If the HDD is unable to complete commands, this may result in the controller reporting the HDD off line due to a timeout condition."
    • "Higher than expected failures rates have been reported on the Maxtor Atlas 10K V ULD (unleaded or lead free) SCSI hard disk drives. If the hard disk drive (HDD) is unable to complete commands, this may result in the controller reporting the HDD offline due to the timeout condition. The primary failure modes have been the HDD failing to successfully rebuild and also failing after a rebuild has completed."
    • Sounds promising, but it turned out that drives in our PE6850 are all Seagate.

No comments: