Disaster Recovery in the CloudRussell CollinghamSysOps
“Failures are a given and everything will eventually fail over time.”
10 Lessons from 10 Years of Amazon Web Services
Disaster recovery is all about preparing for failure and having a recovery plan to enact following a disaster. Any event that impacts a business in terms of time, money or reputation could be considered a disaster. When designing a software solution, a system architect needs to build for failure, and better still, unknown failure. Failure could be small in scale, occurring in a specific hardware component such as a hard drive or network card, or large scale, happening in a particular location such as a natural disaster like a flood or earthquake. The traditional approach would have been to duplicate any critical hardware infrastructure onsite, and store backups offsite. This requires purchasing and maintaining more hardware and implementing effective backup processes. The cloud-based approach, on the other hand, is much more flexible. Microsoft Azure and Amazon Web Services (AWS) both provide scalable, on-demand, pay as you go, multi-region platforms for handling failure.
Four typical scenarios for business continuity are the minimal backup & restore, keeping a bare minimum backup infrastructure available which has to be enabled manually, having a low capacity but fully working infrastructure on standby, and having a multi-site fully active duplicated infrastructure running at all times.
For the backup and restore scenario:
You first need to retrieve your
Then, rebuild the cloud infrastructure, either manually or by using Cloud Formation (AWS) or Azure Resource Manager Templates
Restore the data backup to the new platform
Switch over to the new system by updating the DNS records.
From a cost perspective, this is the cheapest scenario, but the recovery time is the longest. There is also a high level of risk that things just don’t work as expected when the power is turned on. Moreover, there is a high dependency on a reliable backup process, and further, the precise platform infrastructure needs to be scripted or documented so that it can be rebuilt accurately.
In the pilot light scenario, the database is mirrored by the cloud platform, thus eliminating the data retrieval and restoration time.
After a disaster, the cloud infrastructure has to be rebuilt and the DNS updated, but the downtime would be much shorter than the backup and restore scenario. It would be worth occasionally spinning up the secondary infrastructure to ensure it works as expected.
AWS (Dynamo DB) and Azure (Azure Cosmo DB) both provide NoSQL databases that are replicated and available in multiple zones. For SQL databases the options are AWS RDS (which provides MySQL, MariaDB, Oracle, and PostgreSQL), or the more recent AWS Aurora which has even more benefits, and Azure SQL, with each providing multi-zone replication and availability.
The fully working, low capacity, standby scenario improves on the pilot light scenario by having a “warm standby” always running. The warm standby is a reduced capacity platform that can autoscale according to demand.
The network router would distribute a small amount (5-10%) of traffic to the secondary, low capacity, platform, providing confidence that the secondary system is fully working at all times. The primary and secondary databases would be kept in sync by the cloud platform. In the event of a disaster and the failure of the primary platform, the secondary platform would take on the full load by autoscaling. The only delay would be in the time it takes for the secondary platform to scale up to the required level to support the current load. Concerning cost, the secondary platform is always running but not at the full-scale size of the primary platform.
The biggest advantage of this fully working low capacity scenario is that the response to a disaster is fully automated, with no manual setup or intervention required. The network router should detect that the primary platform is unresponsive and immediately route all traffic to the secondary platform which will then autoscale to full capacity.
On AWS, Route 53 handles DNS and traffic flow. On Azure, this is handled by Azure DNS and Azure Traffic Manager. Both of these let you set a specific traffic policy to handle the various scenarios you wish to implement.
For server-based solutions, AWS provides EC2 instance, and Azure provides Azure Virtual Machines. Both providers have an auto-scaling solution to spin up more servers according to demand. For serverless solutions, AWS Lambda and Azure Functions both utilise API interfaces (AWS API Gateway and Azure API Management) that scale automatically according to demand without having to worry about provisioning servers.
The final scenario is a multi-site fully active scenario. Both Azure and AWS can provide cloud services in many regions (such as Europe, Asia Pacific and the USA). Within each region, there are multiple grouped availability zones (for example in Europe, London has 3 AZs).
The multi-site scenario offers additional disaster protection by fully replicating the platform in two availability zones. Keeping both sites at full capacity means virtually nil downtime as the network router will direct traffic to both sites continuously and in the case of one platform failing will route all traffic to the remaining site.
This is the most expensive scenario as the entire full capacity platform is replicated across two geographically separate locations, with both platforms up and running 24/7.
Some minor variations could be made to each scenario, for example by mirroring the database to a completely different region or to a separate “read-only” locked off cloud account entirely, or by using a different region as the secondary site, or even hybrids, for example by keeping the low capacity standby on a different availability zone.
Ultimately your choice of disaster recovery scenario depends on two elements - your budget and the time you are prepared to wait while your business is offline. Both Microsoft Azure and Amazon AWS cloud services provide a wide range of features to assist in recovering from failure, but these come at a financial cost. Choose the most appropriate scenario for your budget and your application.
What does your disaster recovery plan look like? Let us know in the comments below or over on Twitter, @hedgehoglab.