Yesterday, EveryBlock, along with a number of other Web sites, was knocked off-line when our hosting provider, Amazon Web Services (AWS), began having service problems in one of its regions. (A quick aside: a “region” is a term of art for AWS, meaning, roughly, a geographically-related group of data centers). We, the EveryBlock team, just brought the site back up in the past hour, and it is now operating normally.
While the acute problem originated with AWS, EveryBlock is not without blame for this downtime. Frankly, we screwed up. AWS explicitly advises that developers should design a site’s architecture so that it is resilient to occasional failures and outages such as what occurred yesterday, and we did not follow that advice. As the team member who is de facto in charge of this area of our project, I apologize to our users who rely on EveryBlock to connect with their neighbors and get their local news.
We put all our eggs in one basket, and that basket got knocked over. All of our servers and related resources were running in the same — and bear with me if you are not technically-inclined person, I’ll spare you the industry jargon as much as I can — availability zone (AZ). Within a region at AWS, an AZ is akin to a single data center, a single location containing physical servers and networking gear. The service disruption to AWS yesterday (and which is still ongoing as I write this) impacted all of the AZs in this region. However, had we deployed our various servers across multiple AZs and taken into account the fact that individual servers and other services that AWS provide can and do go down from time to time, we would likely have remained available during this disruption. In fact, while many sites were down for the same reason as EveryBlock, other AWS customers who designed for these contingencies remained up during the trouble.
To get back online, we began moving data from the affected AZ to a different one. Fortunately, we still had access to the servers, even though they were running slowly, to the point of unusable if you were trying to access the site. This morning we had finished migrating over to a new AZ, and after tracking down a few bugs and lingering dependencies on that old AZ, we brought the site back online, fast as ever.
What are we doing to avoid this in the future?
As soon as the AWS outage began, the EveryBlock team recognized that we had work to do guard against this kind of thing in the future. We also recognized that this is a pretty well-worn path, and that, in addition to AWS’s guidelines, there are plenty of examples of a well-architected, highly-available site running on services like AWS from which we can draw experience.
Specifically, we will be setting up servers in multiple AZs, and designing how they interconnect in such a way that, if one or more servers goes down, or even an entire AZ, the other servers will be able to pick up the slack and continue serving EveryBlock to users.
We’re also challenging our assumptions about the various bits of our site and how they work together. Often times during development of a site like EveryBlock, you make choices that in the moment seem expedient, but actually introduce too strong a dependency between different parts of the site, making it hard to stay up when one of them goes down. Software people refer to this as tight coupling, and it’s better, in a distributed server environment like AWS as in software, to be loosely coupled. That requires us revisiting some design decisions and will take a little time to roll out.
In the meantime, we feel that EveryBlock is already in better shape than yesterday, in terms of being able to withstand this kind of outage. However, given that AWS is continuing to restore service, we can’t say for certain that we won’t go down again. But the changes we made in the last day to get the site back up also had the benefit of putting in to place some of the new site architecture that we will need to be more available regardless of this particular incident.
Lastly, we’d like to give an unequivocal endorsement of AWS. It is a terrific service, and we love being able to set up new servers as needed, and with great flexibility. Web site hosting is an art, and sites and hosting providers do go down from time to time. Overall, AWS has provided EveryBlock consistent, reliable service. And we want to give a shout-out to the engineers working furiously and without a lot of sleep to fix things up in the affected region. It’s not a fun situation to be in, and they are doing their best in a tough, pressure-filled environment.