Replacing Failed Servers
Server fail. Cloud servers don't fail more often then bare metal ones. But in the cloud we automate the response and immediately replace dead machines with new ones.
Detect and React
Every instance has an agent running that maintains an encrypted connection to Scalarium. This connection is used to control the instance.
If this connection dies or times out, Scalarium notices and starts to observe the instance. If there is no immediate re-connect, Scalarium marks this instance as "offline".
The remaining instances will re-configure themselves and exclude this instance. As soon as the instance re-connects, the cluster will again re-configure itself to include the new instance.


Heal
If the instance does not re-connect, Scalarium will restore the cluster to its previous state by stopping and starting the instance.
Effectively this will boot a new EC2 instance with the same configuration and roles as the dead instance. As soon as the new instance is ready, the cluster will be notified.
This automatic "healing" works great for stateless servers like load balancers or application servers.
If you have more than one of those running, there should be no downtime or only a very short one.
Self healing can also be enabled for database servers but replacing those usually still causes downtime.