Friday, November 10, 2006

perc 4e/Di on Dell PE6850 saga continues...part C

With the load from the full database dump plus the application load burst we set up last Friday , the problematic server 'syb04' generated a few alerts over the weekend. The alerts complained the stamp didn't show up right after the 'logger' call. We were very excited, thinking we were able to reproduce the problem this quick. The next thing would be just to pick out what to upgrade from a decent list of potential upgrades.

Examining closely the local log as well as the log on the remote syslogd server, however, showed that such 'missing' stamps appeared up right after the complaints of their absence. Most were within the same second, and only one was one second late. Therefore such alerts were identified as false positives.

Needless to say, we were very disappointed. Even worse, this remained the case for the full week. We started to toss around the idea maybe the hiccup was merely a delay and we overacted a little by flipping the switch.

To "add insult to the injury", our constant attention was demanded by a lot of database problems related to application peak load which was coerced to repeat. The problems were:
  • Sybase ASE log device filled up, causing the application peak load come to a sudden halt, until Sybase is restarted with log cleared.
  • Hourly transaction has grown from 20M each to over 1G each. It seemed like some transaction failed to be committed.
  • In turn the transaction dumps filled up the disk.
So far, the server endured seven 24-hour days of load run, which totals 24x7x3= $504 load peaks. A regular day have only 2 peaks, therefore, this equals to 250 days worth of load.

I am betting a small sum of money on the PR (patrol read), whose background scheduling may be surprised by the sudden spike in disk IO caused by the nightly full database backup as well as the daily application peak. To force PR to collide with the load, I wrote a script to check PR status and start one if none is 'In Progress' already, as reported by 'megapr -dispPR -a0'.

BTW, the 'megapr -dispPR -a0' command alone causes the following errors in PERC controller's exportlog. My inquiry on this error got no response from Dell's linux-PowerEdge forum, which is monitored by a few Dell engineers.
11/07 10:25:51: MPT_Rec: INQ Error - Negotiating LD[6] pRfm a07517c0
11/07 10:25:51: MPT_Rec: INQ Error - Negotiating LD[16] pRfm a0743360
11/07 10:25:51: GET: SCSI_chn=ff, rtn status=0


Barry said...

if its under warranty, replace the controller.

jackOfAllTrades said...

The problem is no diagnostic utilities showing anything wrong with any hardware component. Dell Tech support took the position that CentOS is unsupported (even those CentOS 4 is a certifiable twin of RHEL 4 AS) and Dell Diag reports nothing wrong with hardware, so nothing, zip!

IT-Dude said...

What ever became of this for you? We received similiar log messages and are wondering what they mean. We also see the same message as you and also things in our like: "Rejecting MISC opcode: unknown sub-opcode (0x26)" and "Rejecting Unknown FC DCMD 1f"