A power outage around
1PM yesterday in the San Francisco area hit the data center at 365 Main,
home to Craigslist among others. The outage lasted some 45 minutes before the backup generators kicked in, but when power was restored the Craigslist staff found they had corrupted databases to rebuild. That work last until at least dawn. In a post responding to user inquiries about the outage and the lack of backup power a staffer writes:
Our colo charges us a serious amount of money to provide continuous uninterrupted power 24/7/365, even in the event of a blackout. Heck, they even say that their building is earthquake proof. They have huge backup generators, with two levels of failsafes, which they test monthly. Of course, during the power outage in downtown SF yesterday, these highly touted super backup generators failed to kick in.
Lots of big sites that share the facility with us were also down during this time, including LiveJournal, Alexa, Sun and CNET. Unfortunately for us, our DBs did not like having the power suddenly cut mid write (mentally multiply these writes by the 1000s of people who are simulateously posting). Once power was restored, and the DBs were brought back up, they were all corrupted, and had to be rebuilt.
Our system administrators spent over 12 hours working continously on this problem, some of them working past dawn. They derserve some serious kudos for getting the site back with almost no data loss, in a relatively small (considering the circumstances) amount of time.
Our “guaranteed power” colo has some serious explaining to do. To us, and plenty of other companies who pay them so that incidents like yesterday don’t ever happen.
All of which to answer your orignial question: We do try to have our website available continously, and actually shell out the money to host our machines at on of San Francisco’s top facilities. Obviously, that’s not good enough, and we already have plans in place to improve this situation.
Stories like this keep me excited about utility computing. Building fault tolerance into systems is incredibly difficult and expensive. The Craigslist team spent extra money and manpower doing everything by the book to avoid downtime, but still ended up with a significant outage. Ultimately, only having redundant operating copies of an app is sufficient to ensure availability, but that’s been horribly expensive and difficult. Utility computing changes that.