Saturday, November 04, 2006

perc 4e/Di on Dell PE6850 saga continues...part B

After searching up & down, I compiled a decent list of potential upgrades and toggles to try out on syb04. None of them is apparently pertinent enough to have you say 'ahh-ha'. I purchased and put into production a new server named syb06, the one killed by oom-killer and cured by kernel-hugemem. With the improved production configuration mix, time is more affordable than last time when syb04 locked up. So, our team of 'experts' decided to reproduce the recent hiccup or the older lockup problem reliably before we attempt a fix this time around.

Hobbit Monitor is running all the servers, so it is rather easy to catch the old lockup problem wherein all checks went to 'purple', as in 'stale', or 'no report received' status. It is a bit tricky to detect when a hiccup happens. If it happens squarely inside the 5-minute interval Hobbit Monitor uses, we'd miss the signal! It seems it is not all that easy to change monitor frequency down to 1 minute for one single client, as nobody has answered my question on the Hobbit mailing list for three days now. After much discussion of alternatives, I come up with a way and verified it works.

With the monitor fine-tuned and focused on syb04, load is added to it first. Count full nightly database backup and daily peak as two load situations, we need have at least 28 peaks to equate to the 14 days leading up to the lockup. The nightly database backup takes only 25 minutes, and is very easy to run it continuously by simply changing cron schedule to every 30 minutes instead of every day. So, we did that. After 20 hours (~= 40 load peaks), nothing happened. Since we don't plan to work over the weekend, it is decided to simulate the daily load peak and let it run continuously. It took some Java code change and it is done. So, we'd have both the application load and the backup load against the server over the weekend.

* fingers-crossed *

No comments: