Is It Better To Be Lucky Or Smart? How Eventbrite Weathered the Amazon Outage

Last week was a bad week for many websites hosted on Amazon’s EC2 cloud computing platform, specifically users of their EBS and RDS services. At Eventbrite, we managed to escape a total outage. I’ll go into the reasons in the next paragraph. However, for a really well-reasoned read on some practical steps you can take to avoid future outages on a cloud-based service, please read Joe Stump’s excellent blog post here (ignoring his prejudice against SQL-based data stores).

The fun and games began early Thursday morning at 12:48 AM PDT with a flurry of alerts from our monitoring systems. Eventbrite’s systems engineering team groggily started inspecting the alerts. Eventbrite.com was still available, still serving traffic, still handling transactions, but the news was not good. We’d immediately lost two database slaves and our backup database master. Three production web servers were unavailable and a great deal of our testing environments were gone as well. Responses were a little slow but Eventbrite.com was still working. Also, due to our use of Puppet, which is open-source software that allows the identical configurations to be pushed to multiple servers simultaneously, we had two database slaves built and into production in a few minutes to take some of the load off the remaining database slaves. Eventually, we rebuilt the three web servers that died.

After we made sure no traffic was trying to be sent to the unavailable server instances, we began to try to diagnose what happened. Amazon had sent no notice yet. As part of the team was looking through server logs, I was on the phone with Amazon support who confirmed there was an issue in most of the East Coast availability zones where our server instances are located. One of the zones was unaffected and, due to how we’d planned our architecture, we had enough servers there to weather the unavailability of the zones that were impacted. So, we were both lucky and smart. We’d thought about separating out our instances into different zones, but not enough about redundancy in our testing and code deploy environments. Our QA environment was basically unusable and we were unable to easily deploy code for about 36 hours until the EBS instances that were unavailable started coming back to life. We’ve since corrected these single points of failure so that we won’t be in that situation again.

Despite the systems engineering team getting very little sleep that day, we learned a lot about the robustness of our production environment. We had speculated that we could serve all Eventbrite’s traffic on one master database server and one database slave, but, no matter how many tests are run, nothing proves the point like real production traffic.

For graph junkies (like me) here’s some performance pr0n from Alertsite that shows response times and availability of the Eventbrite home page during the week of the Amazon EC2 EBS event (details of which are available here).

The event started 12:47 AM on Thursday, April 21st. As you can see from the graph, our home page performance was largely unaffected. There were some spikes to over two seconds to load the page, but we see those at other times. This test runs every five minutes and it tests to see if content that we expect to be there on a successful page load is present. We run this test from New York, LA and London. We have a more granular monitoring both internally and on Pingdom, but this is representative of those and easier on the eyes.

For those of you with finer vision, you can see a little red line appear several hours before the event. This was a change we made to make DNS lookups more frequent while we change some infrastructure on the site. It’s not linked to the event. We made the change in preparation of moving our site load-balancing to Amazon’s Elastic Load-Balancing service. I’ll blog more about that later as we roll out a ton of site performance and reliability improvements.

In summary, Eventbrite’s infrastructure handled a major issue on EC2, but not without a few problems to our testing environments. We’ve fixed the single points of failure that we found during this issue. The major lesson learned is don’t forget about parts of your infrastructure that are essential, even if they aren’t critical to the basic operation of the site. Oh, and, even if you think you’re smart enough, rabbit’s feet and four-leaf clovers are encouraged.