node.js version selection

2011/11/21

We deployed support for node.js apps some months ago. When you wanted to select a node.js version different from the default, you had to do so with, for example, custom JSON.

Until today.

selecting a node.js version

You can now select the node.js version via the node.js settings, stored with your node.js app server role. Of course, you can still set the version by using custom JSON.

Improved instance selection when repeating deployments

2011/11/18

When you deploy an application with Scalarium, you can pick the instances to deploy to. Unless you explicitly deselect some instances, the deployment hits every running instance.

We changed that behavior slightly for repeated deployments.

Say you have a deployment in a cluster with two running instances. During that deployment you decided to skip one instance. When you repeat that deployment, the skipped instance is still skipped, all other running ones are selected for deployment.

deselected instances in repeated deployments

For 24/7 instances that results in every instance that was selected before, but it also includes load and time based instances that were not running before but are running when you repeat the deployment.

Introducing the Scalarium client

2011/11/17

When working on an instance over SSH, there are lot of jobs that need to be done again, and again, and again. Scalarium being a product that focuses on automation, we thought it'd be a good idea to create a handy tool that helps us with these jobs. We call it the Scalarium client.

From now on, the client is part of the Scalarium agent, which runs on every instance. The client is symlinked in /usr/sbin, so you can use it from anywhere.

Here's what it looks like.

root@instance:/# scalarium-client
usage: scalarium-client command [command options]

Commands:
    help   - Shows list of commands or help for one command
    list   - List all chef JSON files and their activities (setup, configure, ...)
    log    - View the latest or a given chef log file
    run    - Do some real work, call help run for details
    status - Display useful information

The client makes things easy that were possible before, but tedious. It tells you what happened on your instance, as an overview or in detail. It tells you the status of the Scalarium agent.

The client also lets you do things that weren't possible before. Imagine you want to run a single one of your custom recipes. You also want it to look and feel like you triggered it from the Scalarium UI, but you don't want to actually go to the Scalarium UI.

The client does that for you. Your recipe is run, custom JSON you defined in the Scalarium UI is included, the chef run shows up in the instance's logs in the Scalarium UI.

Give it a try, and have a look at our knowledge base for details and examples.

New Chef Event Type: Shutdown

2011/11/02

Scalarium is all about automation and customization. The main system behind the automation for the instance configuration in Scalarium is the life cycle system.

When an instance boots and connects back to Scalarium, it triggers the setup event and thus executes the bootstrapping.

Once an instance successfully finishes its setup, the configure event is triggered on all instances in the same cloud. This way they can update their configuration if necessary. So a load balancer instance will add the newly booted application server and the database server will add an ACL entry that allows this application server to connect to the database.

When an instance is stopped, the configure event is triggered again. This way your applications and services always know about the current state of your cloud and can respond to changes if necessary.

Scalarium also offers the deploy and the undeploy event that are triggered when you deploy or delete an application.

Today we introduce a new event type: shutdown.

The shutdown event is triggered when you stop an instance. Previously Scalarium would immediately stop the instance on EC2 and trigger a configure on the whole cloud.

From now on Scalarium first sends the shutdown event to the instance and waits 45 seconds and only then really stops the instance. During those 45 seconds you can run any cleanup recipes that e.g. shutdown services or deregister your instance from other services.

The built-in recipes make use of this to stop Apache or MySQL on the instances. This way instances are removed from the load balancer in a cleaner way.

As with all other event types, you can of course add your own Chef recipes for your role that will be run on shutdown.

logo best in cloud award

If want to know more about the shutdown event or custom instance setup in general, make sure to check out the documentation.

Awesome customers are awesome

2011/10/21

Today I want you to introduce you to two of our customers. Both had a good last week and we thought we should share that with the rest of you.

Infopark

logo infopark logo best in cloud award

Thomas Witt sent us this nice mail

Dear everyone,

we just won the Computerwoche #1 award from Computerwoche's Best In Cloud award (category: platform as a service). Without AWS and Scalarium this wouldn't have been possible! Thank you guys!

Best, T.

trophy best in cloud award

Infopark offers a CMS called Fiona which is widely used especially in the German speaking countries.

The best in cloud award was handed out by a big German tech newspaper called Computerwoche (“computer week”). They applied with a project called “Airport Nürnberg auf Wolke 7 - Webauftritt, CMS und WebCRM als Plattform aus der Cloud”.

Infopark hosts the SaaS version of the Fiona CMS on Scalarium and automatically creates a tenant version for new customers with our API

Check the infopark cloud express CRM at infopark.com

Crashlytics

logo crashlytics

Scalarium customer Crashlytics had a great last week too. Jeff Seibert and Wayne Chang raised $1 million in a seed funding round.

Crashlytics addresses the needs of app makers to better understand what sort of bugs their mobile applications are experiencing. The lightweight Crashlytics SDK (~ 75 kB) works alongside other SDKs without any problem.

They care deeply about building great tools for developers and build a product for iOS crash reporting - with Android crash reporting coming soon.

If you are developing mobile apps you have to check it out and sign up - crashlytics.com

Crashlytics processes the crash reports and exceptions and hosts its public Web site on Scalarium. In this way they can ramp up staging systems and scale with the number of incoming crash reports in no time.

You have an awesome story you want to share? Feel free to shoot me an email at thomas.metschke@scalarium.com

Regards, Thomas

Permissions and Access Control

2011/09/22

As for access rights, until now there was only one possible difference between two users. A user was an admin, or he wasn't. As an admin, you had access to some features that you didn't as a normal user, like managing other users. Every newly created user had access to all existing clouds and every existing user had access to a new cloud. We wanted to let users manage access rights in their accounts per user and per cloud.

The distinction between admins and normal users still exists. You still need to be an admin in order to manage, for example, users, credentials or SSH keys, but access to clouds is handled with the new permission system.

Whenever you try to access a cloud or something that is a "child" of a cloud, your permissions are checked. In this respect applications, instances, roles and deployments are all children of clouds. This means that when you change a user's permissions for a cloud, you also change his permissions for any application, instance, role and deployment that belongs to that cloud.

Here's what the UI for managing permissions looks like.

Permissions can be edited per user per cloud

There are a few things going on in this interface. Let's have a look at the details.

The tabs let you choose between "Clouds" and "Users". Inside these tabs you are able to change default permissions and concrete permissions.

When you are in the "Clouds" tab, the default permissions are the ones that existing users have to a new cloud. In the dropdown below you can select a cloud in order to see and edit every user's permissions to that cloud.

When you are in the "Users" tab, the default permissions are the ones that a new user has to existing clouds. In the dropdown below you can select a user in order to see and edit that user's permissions to every cloud.

For every combination of a user and a cloud we store a permission record. A permission record consists of a level and two flags. The level is something new and it can be one of "manage", "deploy", "show" and "none". The flags are not new, we just changed their scope.

In the past you could already decide whether SSH user handling was enabled for a user or not. On top of that, you could decide whether that user was also added to the sudoers file or not. Instead of making this decision per user, you can now make that decision per user and cloud. That's why we removed the according checkboxes from the user form.

No more SSH flags when editing a user

But let's get back to that permission level. As stated above, it can be one of "manage", "deploy", "show" and "none".

The level "none"
prevents you from noticing a cloud. You won't see the cloud or any of it's children. And of course you can't trigger any actions or edit things.

The level "show"
lets you see a cloud and all its children, but you are not allowed to take any actions.

The level "deploy"
lets you deploy all applications belonging to this cloud as well as trigger cloud deployments like update cookbooks. You can't edit the cloud or any child.

The level "manage"
lets you do anything you were able to do before we introduced the permissions feature. So you can add, start/stop and remove servers, add, edit and remove roles and applications and so on.

As mentioned earlier, the distinction between admin and normal user still exists. Admins can change permissions, including their own ones. That's why we decided to not allow to restrict an admin's permissions at all.

Admins always have full access

Our API is also aware of permissions. When you try to access a cloud that just doesn't exist, we still respond with HTTP status code 404 and the message "Resource not found". However, an attempt to access a cloud that you don't have permission for results in a response with status code 403 and message "No permission".

Tagging Support

2011/09/21

One lesson of the EC2/Scalarium outage was that we need to make it easier to manually interact with the instances. If Scalarium has a problem you should still be able to manage your instances.

Today we released tagging support that makes exactly this a lot simpler.

We now tag every instance. If you look at your instances on the AWS console, you can immediately tell which instance is which.

Tags on the AWS console

When you boot an instance, Scalarium will tag the cloud, instance name and instance roles on EC2. All existing instances have been tagged too.

Default Datacenter (Availability Zone)

2011/08/26

Today we wanted to shine some light on a small but useful feature of Scalarium: the default datacenter (or availability zone) of a cloud.

When you create a cloud you chose an AWS region that this cloud should live in. This defines which global EC2 installation you want to use, e.g. start your servers in Virginia (us-east-1) or in the EU (eu-west-1).

Within every AWS region there are at least two different availability zones, the actual datacenter where your servers and EBS volumes will live. When you add an instance to a cluster you can set the availability zone of the instance. Scalarium always allowed (and encouraged) you to place your instances in multiple availability zones.

The reason why you want to spread your servers across multiple availability zones is redundancy. Should one availability zone experience a problem, you still have unaffected instances. A typicall setup would be to have most of your instances in for example eu-west-1b for performance and cost reasons and then have one or two backup slaves in eu-west-1c or eu-west-1a.

Another setup would be to place entire clouds into a dedicated availability zone, e.g. the production cloud in us-east-1a and the backup cloud in us-east-1b.

When managing such setups you have to remenber where a new instance should be placed. And this is where the default availability zone setting of a cloud comes into play. It allows you to specify exactly that: the default availability for new instances in this cloud.

You can set the default availability zone of a cloud when editing a cloud via the "Edit Cloud" link in the actions menu.

Setting the default availability zone of a cloud

Custom EC2 Security Groups

2011/08/21

Scalarium manages the EC2 SecurityGroups for you by default.

Scalarium generates a special SecurityGroup for every build-in role and assigns it to every instance on boot. This way Scalarium can make sure that only your own application servers are allowed to talk to your MySQL server or the web servers are publicly reachable via port 80.

But sometimes you need more flexibility, e.g you want to open a special port on a custom role. Previously you had to open this port on the SecurityGroup "Scalarium-Custom-Server", the general group for all custom roles.

We just deployed a feature that allows you to specify one or more custom SecurityGroups per role:

Custom SecurityGroups

Under the "Role Settings" tab of every role you can now specify a comma separated list of EC2 SecurityGroups. Scalarium will use those when starting the instance in addition to the build-in SecurityGroups. This allows you to manage your filtering in a more fine-grained way.

In order to create and manage the custom SecurityGroups please use the AWS console.

Please note that due to EC2 limitations the SecurityGroups of running instances cannot be changed. Also, assigning a non-existent SecurityGroup will result in a failed boot.

Please check that the created SecurityGroups belong to the same Credential and AWS Region you use in your Scalarium environment. If you use multiple regions/credentials you will have to create the SecurityGroups for each region/credential.

The Amazon EU and Scalarium outage on August 7th

2011/08/19

With the AWS EU outage aftermath fading we wanted to explain what happened last Sunday to EC2 and Scalarium. First, make sure to read the Amazon EC2 explanation. Basically, one of the EU availability zones (or datacenter in simpler terms) had a power failure and consequently all instances and EBS volumes in this datacenter went down. Inconvenient as it may be, this is something that can happen and something you should consider in your architecture.

Especially for Scalarium the requirements for redundancy and availability in such cases are very high. So we had implemented ways to cope with complete datacenter or region failure. Unfortunately several issues lead to Scalarium being unavailable for several hours after the initial power failure. We are sincerely sorry for that and want to explain why this happened and what we did to make sure Scalarium will handle such failure scenarios in the future without any problems.

The Setting

Before we dig into the Scalarium architecture we need to understand the EC2 regions and availability zones better. EC2 consists of five completely separated regions (== independent EC2 installations). At the moment those regions are us-east-1 (Virginia), us-west-1 (California), eu-west-1 (Ireland), ap-southeast-1 (Singapur), and ap-northeast-1 (Tokyo). As each region is a separate EC2 installation you cannot share or move any EC2 object like an elastic IP or EBS snapshot. But this also means that any issue in the EU region will not propagate to the us-east-1 region.

Every region consists of multiple availability zones (AZ) or what could be called datacenters in an over-simplified world. The availability zones are named after their region with an added character, e.g. the region eu-west-1 has the following AZs: eu-west-1a, eu-west-1b and eu-west-1c. Availability zones have independent power supply and flood levels so that issues like a power failure should only affect one zone. Availability zones within one region share some common EC2 infrastructure like API, command nodes and routers. This allows you to migrate an elastic IP from one AZ to another or create an EBS volume in eu-west-1a out of a snapshot in eu-west-1c.

AWS EC2 Regions and Availability Zones

Traffic within one AZ is fast and free while crossing AZs or even regions adds latency and you have to pay for the traffic.

When you start instances with EC2 or Scalarium, you need to choose the AZ in which the instance should live. Once an instance is running in an AZ it is of course dependent on the availability and connectivity of the AZ in order to serve its purpose. EC2 has no built-in redundancy magic that will move an instance from one datacenter to another if one AZ experiences any problems. Having the possibility to use multiple AZs easily doesn't add any redundancy to your application. You have to actively use it by starting multiple instances in multiple datacenters and share your data across.

What is very important to know about the AZ naming is that the a, b, c etc names are just logical names that are arbitrarily assigned per AWS account. That means that for two different AWS accounts the logical name eu-west-1a could point to a different physical datacenter. Or it could be the same. There is no official way of finding it out. Amazon's reasoning behind it was that if the AZs would be the same for every account, most people would end up using the 'a' zone as it is the first in the list. In order to prevent e.g. us-east-1a having 90% of the us-east-1 region customers, the names are shuffled for different accounts. Within the same AWS account the naming is of course stable and ap-northeast-1a is always the same ap-northeast-1a.

Scalarium Architecture

Scalarium consists of several components, the most important ones are the frontend application servers serving https://manage.scalarium.com, the backend workers doing all the actual work like starting and stopping instances, the database masters (CouchDB and Redis) and the agent bus.

In order to provide a certain redundancy Scalarium is installed in several datacenters in multiple regions. We have one main installation in a region where the active servers and the database master are. In our case we used to run the main installation in the eu-west-1 region, let's say eu-west-1a. With the main database servers in eu-west-1a, we have replications/slaves/copies in eu-west-1b, eu-west-1c in the same region and further in other regions like us-east-1a and ap-southeast-1a.

Scalarium Architecture: Usage of regions and availability zones

In the case of a failure of the AZ where our main installation currently is or even of the complete region, we can switch to one of the backup locations as all our data is replicated there. We just need to start more workers and app servers in that AZ. In case of a complete region failure we further need to update some DNS records.

We can spawn more servers and even a complete backup installation with a "kernel" Scalarium. This separate and internal Scalarium installation controls the public Scalarium system and allows us to migrate from one AZ to another (cross-region).

So we thought we were prepared for a power outage resulting in a zone or even region failure.

Timeline

On Sunday around 7:40 PM CEST / 10:40 AM PDT we got the first alerts via SMS that some parts of Scalarium were no longer reachable. We immediately logged into some machines via SSH. We didn't notice anything unusual but investigated for several minutes. Then we abruptly lost connectivity to all our machines in our main datacenter eu-west-1a and could no longer consistently reach the API in eu-west-1. The AWS Status page didn't have any events or warnings and the EC2 forums were quiet. We checked with the AWS console but got only blank reports or it displayed that everything was OK while we still couldn't reach any instance in eu-west-1a. We tweeted that we see some problems on EC2 and started to prepare our options.

At 8:11 PM CEST / 11:11 AM PDT Amazon finally acknowledged that there is a problem by putting "We are investigating connectivity issues in the EU-WEST-1 region" on the AWS status page. As we were used to EC2 having temporary network issues from time to time (like any hoster at that scale), we decided to not immediately switch to one of our backup systems.

The reasoning behind it was, that we were seeing connectivity problems with the complete API in eu-west-1, not only in one AZ (Amazon confirmed the region-wide API problems at 8:27 PM CEST / 11:27 AM PDT). So switching to another AZ within the EU-region didn't seem safe until we had more information from Amazon. Migrating to one of our US regions meant booting up more servers there and updating DNS. As updating DNS takes some time and usually network issues are fixed with minutes, we chose to wait a bit longer with the transition.

As it became apparent that the problem was not fixed within minutes, we started to prepare the migration to our us-west-1 backup system. This is where the first unexpected issue hit us. We prepared and planned for AZ outages by putting the "kernel" Scalarium into a different AZ from the main installation. But for security and accounting reasons all Scalarium installations use different AWS accounts. Unfortunately the random logical naming of the AZs meant in our case that even though the main installation ran in eu-west-1a and the "kernel" installation in eu-west-1b, they were in fact running in the very same datacenter and thus were both down.

This meant that we could not immediately spawn more servers in a replacement AZ. It took us half an hour to re-build the "kernel" Scalarium. This could have been prevented by placing the "kernel" Scalarium into a completely different region rather than into a supposedly different AZ.

Around 8:50 PM CEST / 11:50 AM PDT we started with the migration to the us-west-1 region. We had already all our data there in the form of CouchDB and Redis slaves. With the help of the "kernel" Scalarium the remaining infrastructure was ready within minutes. Usually we would then copy the data from the slaves to new, bigger and faster master machines. Unfortunately all the slave infrastructure was running as small and medium 32 bit instances which was our mistake and the second issue we hit. Thus we could not use the CouchDB view files and had to re-compute them on the master nodes. We tested and ran this recovery and migration process several times on the staging systems but never hit this issue as all systems were 32 bit.

The result was that we had to wait for CouchDB to re-create the views which took some time. CouchDB views are somewhat similar to MySQL indices, so the process of refreshing the views is similar to adding an index on a big table. And we had to do this for multiple views. It took the Quadruple HighMEM instances several hours to complete this. Around 5:30 AM CEST / 8:30 PM PDT Scalarium was working normally again.

Summary of Issues and Effects

Even though we expected AZ or region failures we hit two issues that prevented us from being available within minutes.

The first was that the "kernel" Scalarium was running in the same AZ as the main installation due to the random logical naming of AZ between AWS accounts.

The second problem was that the backup servers were running in 32 bit and thus their pre-computed view data could not be used by the new masters.

This resulted in the Scalarium management UI being unavailable until 5:30 AM CEST / 8:30 PM PDT. Until then Scalarium could not be used to manage our client machines. This did not affect any running instances and did not result in service disturbances. But it meant that new instances could not be started via Scalarium. As the EC2 API in eu-west-1 returned errors for some time and EBS based instances in the affected AZ had issues until Monday or in some cases even Tuesday, there was not much Scalarium could have done to help with instances in eu-west-1. People wanting to start replacement instances in another zone or region had to wait until Scalarium was restored.

Customers prepared for AZ failure, e.g. by having multiple DB instances or slaves in other AZs or regions, could recover easily. Even with Scalarium affected. Customers running only in the affected AZ needed to wait for Amazon to restore their EBS data to continue operations.

What We Did To Prevent This From Happening Again

We fixed the two issues that prevented a seamless migration by putting the "kernel" Scalarium into a different AWS region and running all backup instances in 64 bit. In case of a region failure we can now switch to a replacement region quickly. All data was and is replicated to at least two additional regions and additional AZs within the same region.

We apologize for any issues our unavailability caused.

During the event we reached out to most affected customers and helped them through the outage. If you had any problems during the event you should definitely consider an architecture that factors in AZ failure and run in multiple AZs or even different regions. With Scalarium it is very easy to start additional capacity in a different AZ/region on demand and migrate your site. But this depends on you still having access to your data. A simple LAMP stack with one MySQL server will not survive an AZ outage unless you have e.g. slaves running in other AZs or replicate/backup your data regularly.

If you need any help in building a reliable architecture, don’t hesitate to contact us.