A note about the site being down

Apr222011

Yesterday, EveryBlock, along with a number of other Web sites, was
knocked off-line when our hosting provider, Amazon Web Services
(AWS), began having service problems in one of its regions. (A
quick aside: a “region” is a term of art for AWS, meaning, roughly,
a geographically-related group of data centers). We, the EveryBlock
team, just brought the site back up in the past hour, and it is now
operating normally.

While the acute problem originated with AWS, EveryBlock is not without
blame for this downtime. Frankly, we screwed up. AWS explicitly advises
that developers should design a site’s architecture so that it is
resilient to occasional failures and outages such as what occurred
yesterday, and we did not follow that advice. As the team member who
is de facto in charge of this area of our project, I apologize to
our users who rely on EveryBlock to connect with their neighbors and
get their local news.

What happened?

We put all our eggs in one basket, and that basket got knocked
over. All of our servers and related resources were running in the same
— and bear with me if you are not technically-inclined person, I’ll
spare you the industry jargon as much as I can — availability zone
(AZ). Within a region at AWS, an AZ is akin to a single data center,
a single location containing physical servers and networking gear. The
service disruption to AWS yesterday (and which is still ongoing as I
write this
) impacted all of the AZs in this region. However, had
we deployed our various servers across multiple AZs and taken into
account the fact that individual servers and other services that AWS
provide can and do go down from time to time, we would likely have
remained available during this disruption. In fact, while many sites
were down for the same reason as EveryBlock, other AWS customers who
designed for these contingencies remained up during the trouble.

To get back online, we began moving data from the affected AZ to a
different one. Fortunately, we still had access to the servers, even
though they were running slowly, to the point of unusable if you were
trying to access the site. This morning we had finished migrating
over to a new AZ, and after tracking down a few bugs and lingering
dependencies on that old AZ, we brought the site back online, fast
as ever.

What are we doing to avoid this in the future?

As soon as the AWS outage began, the EveryBlock team recognized that
we had work to do guard against this kind of thing in the future. We
also recognized that this is a pretty well-worn path, and that,
in addition to AWS’s guidelines, there are plenty of examples of a
well-architected, highly-available site running on services like AWS
from which we can draw experience.

Specifically, we will be setting up servers in multiple AZs, and
designing how they interconnect in such a way that, if one or more
servers goes down, or even an entire AZ, the other servers will be
able to pick up the slack and continue serving EveryBlock to users.

We’re also challenging our assumptions about the various bits of
our site and how they work together. Often times during development
of a site like EveryBlock, you make choices that in the moment seem
expedient, but actually introduce too strong a dependency between
different parts of the site, making it hard to stay up when one of
them goes down. Software people refer to this as tight coupling,
and it’s better, in a distributed server environment like AWS as in
software, to be loosely coupled. That requires us revisiting some
design decisions and will take a little time to roll out.

In the meantime, we feel that EveryBlock is already in better shape
than yesterday, in terms of being able to withstand this kind of
outage. However, given that AWS is continuing to restore service,
we can’t say for certain that we won’t go down again. But the changes
we made in the last day to get the site back up also had the benefit
of putting in to place some of the new site architecture that we will
need to be more available regardless of this particular incident.

Lastly, we’d like to give an unequivocal endorsement of AWS. It is
a terrific service, and we love being able to set up new servers as
needed, and with great flexibility. Web site hosting is an art, and
sites and hosting providers do go down from time to time. Overall,
AWS has provided EveryBlock consistent, reliable service. And we want
to give a shout-out to the engineers working furiously and without
a lot of sleep to fix things up in the affected region. It’s not a
fun situation to be in, and they are doing their best in a tough,
pressure-filled environment.

Leave a Reply

© 2014 EveryBlock. All rights reserved.