Rackspace Went Down Yesterday

Browsing Techcrunch this morning, I just found out that Rackspace had a complete outage yesterday at one of its datacenters. For those who do not know, Rackspace is one of the premium managed dedicated server providers in the World, with nine global datacenters.

Upon further investigation, the Rackspace blog starts off with this post about the developing problem on Monday.

We have experienced an interruption in power to a portion of our Dallas-Fort Worth data center facility. Power has been restored and our DC engineers are working on devices that need to be manually brought back online as quickly as possible….

Apparently, they had to switch to generator power for a period of time, but there was still a significant amount of downtime (unsure how much exactly though hearing reports of at least an hour). This begs the question, “how redundant is backup generator power?” Is there always an interruption when switching to backup generators? Should there be?

There are a series of other posts updating visitors to their status, with the last explaining:

We don’t have a lot of details on exactly what happened yet.  When we have an outage, our first focus is on fixing it and getting customers online as soon as possible.  Now that we have the near-term situation stabilized in Dallas, we have some work to do to improve our reliability.  We will follow up with more information as we work through our root-cause analysis.

What’s scary is that comments on the post seem to indicate that the DFW datacenter had been having problems over the last few weeks.

I would say that this is a major event in the managed hosting community. Hopefully it will make other companies take a hard look at their power systems and how redundant they really are.

Follow Up (7/1/2009): More information was published by Techcrunch yesterday –

The breaker on the primary utility feeder tripped, initiating a sequence of events that ultimately caused a power interruption in Phase I and Phase II of the data center. All systems initially came up on generator power without customer impact. The ‘A’ bank of generators, which support UPS clusters A and B in Phase I and UPS cluster E in Phase II, then experienced excitation failure which escalated to the point where the generators were no longer able to maintain the electrical load. Rackspace then attempted to switch to our secondary utility feeder, but was unable to do so due to an issue in the Pad Mounted Switch (PMS). At approximately 3:15pm CDT, power supply through UPS clusters A, B and E was lost when the batteries in those clusters discharged, and equipment receiving power through those clusters experienced an interruption in service.