Hello everyone. It has been a rough couple of days for all of us. It's not an easy job taking responsibility for hundreds of online businesses. Usually it is terrific and sometimes we pay our dues. This is one of those "bumps in the road" that are never fun but definitely can build character.
You should know that we replace and repair servers all the time with ZERO downtime. This particular issue took some unexpected turns.
For those that are interested, I have a pretty detailed explanation and some answers to a few of your questions.
The technical details:
Wednesday: For unknown reasons our primary database server rebooted and self recovered. We decided at that time to allow it to recover instead of diverting to the backup database server. (If we use the backup server the servers must be resynched later and that causes a slow down in performance so we avoid that if we can.)
Thursday: In the morning the same thing happened. So we immediately cut over to the backup database server and took down the primary server to diagnose the hardware. All servers and email were up and running.
When the primary server rebooted and joined the cluster the SCSI channel card for unexplained reasons re-arranged it's physical channels and mounted it's partitions incorrectly causing a myriad of issues too long to explain here. The entire cluster was disabled while we corrected this issue. Why this happened is still under investigation. Every IT professional we have discussed this with so far has no idea why this hardware misbehaved in this fashion. It could not be predicted nor avoided. So the down time was primarily caused by this unexpected issue.
Once recovered the primary database server had to synchronize with the backup so that they would both contain the most current db transactions. This is what caused the speed to decrease for about 12 hours.
Friday: The database servers finished synchronizing. The primary database server was put back into the cluster and things were reported to be 100% again.
UNFORTUNATELY the primary server crashed again with the exact symptom that it had Wednesday and Thursday. We again replaced the hardware without any downtime (the way it should of happened the first time) and began to troubleshoot software issues that may be the cause.
We had to restart the server hardware and software several times throughout the day causing very short but annoying email and web site interruptions. Near the end of the day we determined the issue was due to a Linux kernel bug. I won't bore you with the details, however the team was able to mitigate the issue.
We also performed a database server upgrade on the backup server and will complete the upgrade to the primary database server at 3:00 am Saturday morning. This will result in a very short downtime and server reboot.
Currently things are running 100% and running at the normal speed.
Answers to common questions:
Question: I thought redundant systems never fail. Why isn't the system running 100% of the time?
Answer: It sounds great, however the fact is nobody can create a system that is up 100% of the time. No matter how much money or time you have, these things can and may happen. I did some research and found that a similar issue happened to Google's Gmail system recently. In fact over the last few years you can find most large companies have a similar story (even Oracle systems which purport to "never fail"). I'll add that the tradeoff for reliability is speed. You can't have both.
Question: What have you done to make sure this doesn't happen in the future?
Answer: We have written scripts for the servers to detect any changes in the SCSI addressing upon boot. This way if the drives mount incorrectly, the server services will not start. This will avoid the same issue happening in the future. Once we get word from the hardware manufacturer we may make additional adjustments.
Question: Why didn't your tech support team give us this detailed information?
Answer: Primarily because the team worked feverishly and expected to have it resolved much faster. The tech support guys did not have this detail. Only the server administrators and myself knew the play-by-play. I apologize if any of you felt uninformed. We did try to contact anyone with specific questions to respond.
Please understand that the team worked non-stop until this issue was resolved. Some team members didn't sleep for the better part of 30 hours. We take ALL issues seriously and care about each and every customer. I cannot guarantee 100% uptime. However I can guarantee we will always work as hard as possible to provide the best possible service to our Website Forge customers.
Shane Merem
http://www.websiteforge.com/
Website Design and E-commerce