Creating a disaster recovery plan for AWS

Disaster recovery on AWS

When we look at creating resources in AWS, we look at making them as fault tolerant as possible and always have disaster recovery provisions in mind. Before we look at the kinds of DR (Disaster Recovery) provisions we can implement, it’s important to understand how we measure a DR plan.

The two ways we measure our DR strategy are RTO (Recovery Time Objective) and RPO (Recovery Point Objective). RTO is the time it takes after the disruption to restore operations back to an acceptable level – that is, how long it will take before your systems and data are available again.

An RPO is the point at which we wish to restore the data to. For example, let’s say that a disaster struck on Saturday afternoon at 2pm, when will you restore back to? When was your last backup taken? For some businesses, there will be little tolerance for data loss and you may have to restore all data, all the way up to those items created 15 minutes ago, however, some businesses might be able to tolerate one day of data loss – this is defined by the business, as technologists, we need to find the best way possible to hit these objectives.

So, in AWS there are 3 options for DR. The first is referred to as  a pilot light. This is where we have a scaled down version of the live environment, sitting on standby in a different region (i.e. the instances are not running), however, we do have data replication from live. When there is an issue with the live environment, we swing the DNS to point at the pilot light and scale out to handle production level system load. This option is relatively low cost as instances are not running and therefore not incurring costs. However,  instance launch & DNS changes can take some time, so you may experience a few minutes of outage. Additionally, auto scaling can take some time to catch up with immediate increases in load, this may result in poor system performance for a period of time.

Another option is to run a warm stand by which is a slightly more advanced version of the ‘pilot light’ concept we just discussed. We run a mid-sized fleet of EC2 instances in this option. These servers are not capable of handling production load, however, they can be ramped up in the event of a disaster. We use an elastic IP and attach it to our DR instances to enable instant recovery of the system, with almost no downtime.

Finally, we have a multi site deployment. This is where we have two identical environments running in different AWS regions. Traffic is load balanced between the two (perhaps with geo based routing). In the event of a disaster, one of the regions is taken out of service, while the other continues to server traffic. This results in no downtime experience for end users.

Image used under creative commons

Tagged under: