The Amazon EU and Scalarium outage on August 7th
2011/08/19
With the AWS EU outage aftermath fading we wanted to explain what happened last Sunday to EC2 and Scalarium. First, make sure to read the Amazon EC2 explanation. Basically, one of the EU availability zones (or datacenter in simpler terms) had a power failure and consequently all instances and EBS volumes in this datacenter went down. Inconvenient as it may be, this is something that can happen and something you should consider in your architecture.
Especially for Scalarium the requirements for redundancy and availability in such cases are very high. So we had implemented ways to cope with complete datacenter or region failure. Unfortunately several issues lead to Scalarium being unavailable for several hours after the initial power failure. We are sincerely sorry for that and want to explain why this happened and what we did to make sure Scalarium will handle such failure scenarios in the future without any problems.
The Setting
Before we dig into the Scalarium architecture we need to understand the EC2 regions and availability zones better. EC2 consists of five completely separated regions (== independent EC2 installations). At the moment those regions are us-east-1 (Virginia), us-west-1 (California), eu-west-1 (Ireland), ap-southeast-1 (Singapur), and ap-northeast-1 (Tokyo). As each region is a separate EC2 installation you cannot share or move any EC2 object like an elastic IP or EBS snapshot. But this also means that any issue in the EU region will not propagate to the us-east-1 region.
Every region consists of multiple availability zones (AZ) or what could be called datacenters in an over-simplified world. The availability zones are named after their region with an added character, e.g. the region eu-west-1 has the following AZs: eu-west-1a, eu-west-1b and eu-west-1c. Availability zones have independent power supply and flood levels so that issues like a power failure should only affect one zone. Availability zones within one region share some common EC2 infrastructure like API, command nodes and routers. This allows you to migrate an elastic IP from one AZ to another or create an EBS volume in eu-west-1a out of a snapshot in eu-west-1c.

Traffic within one AZ is fast and free while crossing AZs or even regions adds latency and you have to pay for the traffic.
When you start instances with EC2 or Scalarium, you need to choose the AZ in which the instance should live. Once an instance is running in an AZ it is of course dependent on the availability and connectivity of the AZ in order to serve its purpose. EC2 has no built-in redundancy magic that will move an instance from one datacenter to another if one AZ experiences any problems. Having the possibility to use multiple AZs easily doesn't add any redundancy to your application. You have to actively use it by starting multiple instances in multiple datacenters and share your data across.
What is very important to know about the AZ naming is that the a, b, c etc names are just logical names that are arbitrarily assigned per AWS account. That means that for two different AWS accounts the logical name eu-west-1a could point to a different physical datacenter. Or it could be the same. There is no official way of finding it out. Amazon's reasoning behind it was that if the AZs would be the same for every account, most people would end up using the 'a' zone as it is the first in the list. In order to prevent e.g. us-east-1a having 90% of the us-east-1 region customers, the names are shuffled for different accounts. Within the same AWS account the naming is of course stable and ap-northeast-1a is always the same ap-northeast-1a.
Scalarium Architecture
Scalarium consists of several components, the most important ones are the frontend application servers serving https://manage.scalarium.com, the backend workers doing all the actual work like starting and stopping instances, the database masters (CouchDB and Redis) and the agent bus.
In order to provide a certain redundancy Scalarium is installed in several datacenters in multiple regions. We have one main installation in a region where the active servers and the database master are. In our case we used to run the main installation in the eu-west-1 region, let's say eu-west-1a. With the main database servers in eu-west-1a, we have replications/slaves/copies in eu-west-1b, eu-west-1c in the same region and further in other regions like us-east-1a and ap-southeast-1a.

In the case of a failure of the AZ where our main installation currently is or even of the complete region, we can switch to one of the backup locations as all our data is replicated there. We just need to start more workers and app servers in that AZ. In case of a complete region failure we further need to update some DNS records.
We can spawn more servers and even a complete backup installation with a "kernel" Scalarium. This separate and internal Scalarium installation controls the public Scalarium system and allows us to migrate from one AZ to another (cross-region).
So we thought we were prepared for a power outage resulting in a zone or even region failure.
Timeline
On Sunday around 7:40 PM CEST / 10:40 AM PDT we got the first alerts via SMS that some parts of Scalarium were no longer reachable. We immediately logged into some machines via SSH. We didn't notice anything unusual but investigated for several minutes. Then we abruptly lost connectivity to all our machines in our main datacenter eu-west-1a and could no longer consistently reach the API in eu-west-1. The AWS Status page didn't have any events or warnings and the EC2 forums were quiet. We checked with the AWS console but got only blank reports or it displayed that everything was OK while we still couldn't reach any instance in eu-west-1a. We tweeted that we see some problems on EC2 and started to prepare our options.
At 8:11 PM CEST / 11:11 AM PDT Amazon finally acknowledged that there is a problem by putting "We are investigating connectivity issues in the EU-WEST-1 region" on the AWS status page. As we were used to EC2 having temporary network issues from time to time (like any hoster at that scale), we decided to not immediately switch to one of our backup systems.
The reasoning behind it was, that we were seeing connectivity problems with the complete API in eu-west-1, not only in one AZ (Amazon confirmed the region-wide API problems at 8:27 PM CEST / 11:27 AM PDT). So switching to another AZ within the EU-region didn't seem safe until we had more information from Amazon. Migrating to one of our US regions meant booting up more servers there and updating DNS. As updating DNS takes some time and usually network issues are fixed with minutes, we chose to wait a bit longer with the transition.
As it became apparent that the problem was not fixed within minutes, we started to prepare the migration to our us-west-1 backup system. This is where the first unexpected issue hit us. We prepared and planned for AZ outages by putting the "kernel" Scalarium into a different AZ from the main installation. But for security and accounting reasons all Scalarium installations use different AWS accounts. Unfortunately the random logical naming of the AZs meant in our case that even though the main installation ran in eu-west-1a and the "kernel" installation in eu-west-1b, they were in fact running in the very same datacenter and thus were both down.
This meant that we could not immediately spawn more servers in a replacement AZ. It took us half an hour to re-build the "kernel" Scalarium. This could have been prevented by placing the "kernel" Scalarium into a completely different region rather than into a supposedly different AZ.
Around 8:50 PM CEST / 11:50 AM PDT we started with the migration to the us-west-1 region. We had already all our data there in the form of CouchDB and Redis slaves. With the help of the "kernel" Scalarium the remaining infrastructure was ready within minutes. Usually we would then copy the data from the slaves to new, bigger and faster master machines. Unfortunately all the slave infrastructure was running as small and medium 32 bit instances which was our mistake and the second issue we hit. Thus we could not use the CouchDB view files and had to re-compute them on the master nodes. We tested and ran this recovery and migration process several times on the staging systems but never hit this issue as all systems were 32 bit.
The result was that we had to wait for CouchDB to re-create the views which took some time. CouchDB views are somewhat similar to MySQL indices, so the process of refreshing the views is similar to adding an index on a big table. And we had to do this for multiple views. It took the Quadruple HighMEM instances several hours to complete this. Around 5:30 AM CEST / 8:30 PM PDT Scalarium was working normally again.
Summary of Issues and Effects
Even though we expected AZ or region failures we hit two issues that prevented us from being available within minutes.
The first was that the "kernel" Scalarium was running in the same AZ as the main installation due to the random logical naming of AZ between AWS accounts.
The second problem was that the backup servers were running in 32 bit and thus their pre-computed view data could not be used by the new masters.
This resulted in the Scalarium management UI being unavailable until 5:30 AM CEST / 8:30 PM PDT. Until then Scalarium could not be used to manage our client machines. This did not affect any running instances and did not result in service disturbances. But it meant that new instances could not be started via Scalarium. As the EC2 API in eu-west-1 returned errors for some time and EBS based instances in the affected AZ had issues until Monday or in some cases even Tuesday, there was not much Scalarium could have done to help with instances in eu-west-1. People wanting to start replacement instances in another zone or region had to wait until Scalarium was restored.
Customers prepared for AZ failure, e.g. by having multiple DB instances or slaves in other AZs or regions, could recover easily. Even with Scalarium affected. Customers running only in the affected AZ needed to wait for Amazon to restore their EBS data to continue operations.
What We Did To Prevent This From Happening Again
We fixed the two issues that prevented a seamless migration by putting the "kernel" Scalarium into a different AWS region and running all backup instances in 64 bit. In case of a region failure we can now switch to a replacement region quickly. All data was and is replicated to at least two additional regions and additional AZs within the same region.
We apologize for any issues our unavailability caused.
During the event we reached out to most affected customers and helped them through the outage. If you had any problems during the event you should definitely consider an architecture that factors in AZ failure and run in multiple AZs or even different regions. With Scalarium it is very easy to start additional capacity in a different AZ/region on demand and migrate your site. But this depends on you still having access to your data. A simple LAMP stack with one MySQL server will not survive an AZ outage unless you have e.g. slaves running in other AZs or replicate/backup your data regularly.
If you need any help in building a reliable architecture, don’t hesitate to contact us.