Posted on | June 5, 2008 | 3 Comments
About 7000 people got a health dose of Murphy’s Law when The Planet’s H1 data center exploded this week. Whatever can go wrong will go wrong. Whatever can’t go wrong will probably go wrong too.
For those of you who don’t follow the hosting industry, The Planet had an explosion in their electrical room knock out three walls and damage (beyond repair) a substantial amount of their power distribution gear. Automatic transfer switches that allow the generators to kick in were also damaged. The exact cause of the explosion is still under investigation, all we know is that a conduit exploded.
The bizarre chain of events that followed should be sending us all back to the drawing board to re-write our disaster recovery plans.
The best designed facility in the world with double redundant everything is not going to help much if the local fire department does not give you clearance to power up the generators for hours after an incident. Meanwhile, the facility will begin to get rather warm. Hold that thought for some personal reflections:
I once had to clean up a really big mess (I am an electrician). I was called in to a hospital where a sister company of ours had a guy working on some switchgear. He dropped his ratchet across some live 1200 amp copper bars which sent him to intensive care missing 2/3 of his skin and half of his face. I had elevators, patient care machines, operating rooms, everything .. all down. The explosion knocked out 1/3 of the switch gear and the transfer switches.
72 hours later, we had a very big generator outside re-feeding the building, while we worked to replace everything that was damaged. The generator actually arrived 3 hours after the blast, we just could not re-feed power until we ripped out and replaced a ton of feeders, it would have just exploded again.
You can not understand how difficult it is to work in a room with 30 sweaty electricians amidst the smell of burning flesh in the air until you’ve done it. I can’t imagine that things at The Planet were any easier, except it probably smelled better. It took us a total of 2 weeks (in hours) to finally put everything back together the way that it was.
About 50 hours after the explosion, The Planet began the process of restoring power (from generators) and cooling the data center. Low and behold, one of the generators had a bad breaker. They went from being back in business down to only one floor having power. Finding someone with replacement parts to a 2 megawatt generator in the wee hours of the morning is no easy task.
Amazingly, 80 hours after the initial blast all customers were restored. While this may seem like an eternity to some, its a monumental feat to others (who have actually cleaned up this kind of mess). All in all, The Planet did an excellent job, I can not see any single way that they could have done things better.
What this presents is a very interesting case study in disaster recovery. This is truly the worst case scenario, it needs to be studied and picked apart. The resulting research will write the book on how to buy foreign facilities, test them and assimilate them into your disaster recovery plan.
An eight hour scheduled outage and one guy with a megohmeter could have predicted this, the same could have prevented it. All the while, clients remain oblivious to such tests while running on generator power. These kinds of tests need to be put into place, simple thermal imaging doesn’t tell you what’s going to get hot, only what is hot.
I really hope that The Planet releases all details of the event, at least that way it can do some good. What is most interesting is, how good is any plan when you aren’t allowed to implement it due to the circumstances surrounding the disaster?
Food for thought.