Maintenance Update for July 31st
2010/08/02
Last Saturday we made some changes to our own infrastructure. The planned downtime was 60 minutes tops. Unfortunately and due to a problem with network latency that downtime extended to more than two hours. More on the reasons below. First a little recap of what we did.
Upgraded Infrastructure
We split up the infrastructure Scalarium runs on across more instances. We're managing a good number of instances by now and started to see where we needed to add more capacity or break out processing power to bigger EC2 instances. We're now running on a cluster of of medium and large EC2 instances.
On the way, we migrated some of our components to newer versions, namely CouchDB and Redis.
- CouchDB has been upgrade to the recent 1.0 release. While we were on that matter we also made sure that disk access speed (which CouchDB heavily relies on) was improved drastically by moving it to a RAID 0 of four EBS volumes. On the way we also set up another replica constantly streaming changes from the main database, all thanks to the continuous replication introduced in CouchDB 0.10. If you haven't checked it out yet, CouchDB's replication really is its killer feature. We easily streamed the data from the old database server, while the new replica already streamed the changes down from the new server as they came in.
The good news for you is that support for RAID across EBS volumes is halfway there for Scalarium. There's been excessive benchmarking around it, so be sure to have a look at them, as it will help with the decision on which scheme to pick.
- Redis has been upgraded to the latest release candidate of the 2.0 release line, and has been given a lot more memory to work with. Redis has become an important part of our infrastructure, and it's only fair it gets a good amount of memory to work with. The 2.0 line also brings some interesting new features that we'll be moving some of our use cases to. Migrating data to the new Redis release over the network went just as smoothly as with CouchDB. We told the new server to connect to the old instance as a slave to stream down the full dataset. Then we unset the slave status, and the job was done.
Overall the move itself went smoothly and as planned, and we started pre-warming CouchDB's views on the new server and bringing back up the site.
The Extended Downtime
When we started browsing through the sites for some immediate tests we realized that it was incredibly slow. We profiled the queries sent to CouchDB from our application servers and found that they had huge latency issues, as they took about six times longer than the ones we ran from other new instances set up in the maintenance window. We profiled other access times and found that connection times were the road block. Once connections were set up things went at normal speeds. These things happen on EC2 from time to time, and the recommendation is to go get a beer and wait for them to resolve themselves. You won't find that recommendation in the manuals, but it's not unheard of that latency issues in fact stop after some time.
Waiting for them to resolve themselves was not good enough for us, so we shut down the new instances and brought up new ones. One the way, we added more security groups to them so that it's ensured that the whole cluster is having at least one group in common. Sure enough, the new instances resolved the problem, and we saw the latency times we expected.
In the end, the downtime took a lot longer than expected, and for that we apologize. No instances were harmed during the downtime. The good news is that with the upgraded infrastructure we can put our focus back on new features. We already rolled out a good stash of new features in July that haven't been talked about yet here. We'll showcase some of them over the next week or so. Stay tuned!