Facebook was out for 2.5 hours yesterday due to some technical issues. With millions of users on the network, it did create a furor in the social networks.
A lot of people were dissing Facebook for this outage but as a database administrator by day and as a person running multiple websites, I was lot more interested in the autopsy.
It is no secret that Facebook is constantly pushing cutting edge browser technology to millions of users and to support such a massive operation with such a small workforce, it is nothing short of spectacular.
So, how does one cope with a scenario when the whole network is brought down to its knees?
Facebook’s director of software engineering, Robert Johnson, has written a blog post detailing what happened and how they fixed it.
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
…..
This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.
If you are the technically inclined kind, it makes for a great reading. When the proverbial dung hit the fan, it is pretty amazing that they were able to bring the site back up in a mere 2.5 hours.
I have read that Facebook is running on MySQL servers and it is pretty fascinating how things can melt down when a rogue scenario comes out of the blue and catches you unaware.
{ via Facebook Engineering notes }