When architecting in AWS, there are a number of best-practices that you can follow to ensure that your applications are highly available and fault tolerant.
Firstly, all applications that intend to be highly available and fault tolerant should be designed for failure. This essentially means, you should design them with the assumption that they will fail and that you want minimal or no downtime to be experienced by your users.
To do this, we should utilize autoscaling groups and elastic load balancers in order to make our environment ‘self healing’. Remember, the elastic load balancer will stop serving traffic to an unhealthy instance and an autoscaling group will replace an unhealthy instance once it’s detected.
This self healing design should always be deployed into a minimum of two availability zones. But, as AWS does not guarantee on-demand instances in each availability zone, it is recommended that you purchase reserved instances in both zones that are capable of supporting your web application. While it would not usually be a problem to spin up on demand instances in AWS, imagine that Availability Zone 1 has a problem and is completely unreachable. Every user with a deployment in Availability Zone 1 will be rushing to deploy instances in one of the alternative availability zones, which may result in a shortage of on-demand resources.
This does not just apply to EC2 instances, we should also always enable multi-AZ deployments of RDS and enable automated backups, to be stored in a separate availability zone.
Further to the self-healing concept, we should also decouple our application using SQS so that certain parts of the application can continue to work in isolation from the issues other application components may be experiencing.
To ensure the highest availability, we should enable latency or failover based routing in Route 53. So, if our primary connection to the ELB were to go down completely, we could failover to an S3-hosted static site.
For disaster recovery, we must ensure that our AMI’s and snapshots are copied to multiple regions to protect against any major disasters in a specific region.
Caching static content in Cloudwatch can enable us to serve cached content to users while we’re experiencing issues with the origin. This enables us to provide a seemingly seamless user experience while the backend issues are resolved.
We should ensure that we enable Cloudwatch monitoring and alarms to monitor the environment and be notified of issues (when used in conjunction with SNS).
If you have instances sitting inside of a private subnet, you should utilize a bastion host to connect to them.
Finally, ensuring that the correct scaling options are deployed enables us to ensure system availability and performance. The types of scaling at our disposal are:
- Proactive cycle scaling: which is where you scale your environment at fixed intervals. E.G. for the 8AM rush each day
- Proactive event-based scaling: where you scale your environment in anticipation of a big event (such as a launch event)
- Auto-Scaling: On demand scaling