March 06, 2007

On power and cooling in data centers - part II

But that isn't the end of my tales on power and cooling in data center. Now with our proud new cluster in place, we were in business. Things purred along nicely for about six months. Then in June 2006, I was sitting in the data center around noon one day with some summer college interns. The data center manager was on vacation and everyone else was out to lunch - literally. Suddenly I noticed a flickering in the light fixtures. The flickering ran up and down the entire length of the computer room floor. The flickering then stopped, but it then repeated about 1 minute later. A feeling of dread came over me. I have worked in data centers for many years and have been through many electrical problems and power outages, but I had never seen anything like that before.

Moments later the alarms went off on the UPS's and the batteries started draining. I fired off an emergency email to all IS systems personnel summoning them to the data center right that moment. That was in fact the biggest problem - convincing your coworkers that there is a problem when they aren't there. My servers? They're still up! What the **** are you worried about?

And so it was. I contacted my boss and he told me to start shutting down our stuff. Meanwhile after about 10 minutes I heard a deep sounding collective POOF that echoed across the data center. I looked in horror as I saw that all of the air handlers had lost power simultaneously. I resent my emails to everyone to get them into the data center to start some kind of orderly shutdown, considering that we were now on a extremely short time from to try to attain one.

We eventually were able to get things back together. We had blown some fuses on our air handlers and had lost a phase of electrical power. We turned on the data center after some hours and things ran over the weekend. Then we ran into the same problem over the weekend and again during the next week.

Eventually this led to meetings and pow-wows. The local utility basically said that we were pushing our limits on our allocated line feeds and we in turn queried the electrical contractor who put the permits in to do our electrical work for our cluster expansions. It seems the contractor had rushed the process and had assured us that the necessary power would be there when in fact there were no such assurances. We eventually had to schedule another shutdown as part of an office wide complex shutdown (we rent out of a major office complex which has three buildings in it). Then the power company had to add the needed infrastructure to give us an increase from 1,600 amps to 2,500 amps (at 480 volt , three phase power).

And so it was. What lessons were there to be had in all of this?

1) As your power and cooling demands grow, the potential scope for **** ups will also grow. At first it was only blown power strips at the rack level. Eventually it was overheating the entire data center and overloading the electrical line feeds into the building. You need to do your homework in advance to avoid these problems.

2) Study these issues. Become at least mildly familiar with electrical power and cooling terms, concepts and issues.

Enough for now.

Posted by The Mighty Wizard at March 6, 2007 02:04 AM