Auto Healing
Overview
Auto Healing makes sure that once you defined the available services and hosts, the cluster automatically recovers from any failures. Should one of your servers fail, Scalarium will remove it and set up a replacement instance within minutes, updating the cluster accordingly.
Detect and React
Every instance has an agent running that maintains an encrypted connection to Scalarium. This connection is used to control the instance. If this connection dies or times out, Scalarium notices that and starts to observe the instance. If there is no immediate re-connect, Scalarium marks this instance as "offline".
The remaining instances will re-configure themselves and exclude this instance. As soon as the instance re-connects, the cluster will again re-configure itself to include the new instance.
Heal
If the instance will not re-connect after some time, Scalarium will restore the cluster to its previous state by stopping and starting the instance. Effectively this will boot a new EC2 instance with the same configuration and roles as the down instance. As soon as the new instance is ready, the cluster will be notified.
This automatic "healing" works for load balancers, Rails application servers, PHP application servers, and static web servers. Resulting downtime depends on how many instances are left in the cluster and what kind of instance just died. Database servers will not be restored automatically as there is always the chance of data corruption and manual intervention is recommended.
