You've meticulously followed best practices for your AWS cloud setup: your servers are highly available, spread across multiple data centers (availability zones in AWS), most of your applications are stateless and have more than one replica, and you have traffic incoming from a load balancer targeting the different server types you've set up. You feel secure and ready to handle more customers.
However, during an executive meeting, someone asks:
Last year, region X went down for more than Y hours for service D. Would our customers be affected if it occurs again?
The question is usually asked at a very high level and not specifically about your specific setup and its underlying components.
So you mumble a few things about being HA, the business SLA, downtime being low, being able to recover fast, and how amazing life has been since you were never down before—all while sweating. You haven't answered, but you hope their faith in you is enough. But deep down, you know you won't sleep as well as before.
Let's break down the question into:
What happened last year?
Why does it matter?
Should you plan for it?
What happened last year?
Last year, your business suffered the effects of a disaster, which usually have the following criteria: widespread impact, unpredictability, and a requirement for substantial recovery efforts.
For instance, in 2020, the AWS US-East-1 region experienced an outage that affected multiple services, leading to widespread disruption for businesses that relied on this region for their cloud infrastructure.
Here are typical examples of disasters that can affect your cloud infrastructure:
Natural disasters such as hurricanes, earthquakes, or floods.
Human-made disasters such as cyberattacks, terrorist attacks, or power outages.
The impact goes beyond the infrastructure, affecting your teams, customers, and even shareholders.
Why does it matter?
While it's easy to dismiss the possibility of a disaster affecting your business, especially if you haven't experienced any issues in the past, it's important to understand the potential consequences.
"We never had any issue in the past years... so why would I care about a low-odd event?"
"Because what can happen, will happen." -- says the smart guys.
On a serious note, the impact of a disaster can be far-reaching, affecting not only your infrastructure but also your teams, customers, and even shareholders.
For instance, a small online retailer that relies on AWS for hosting its website and managing its inventory could be severely impacted if there is an outage. Even a few hours of downtime during a busy shopping season could result in thousands of dollars in lost sales, not to mention the potential damage to the company's reputation if customers are unable to access the site or complete their purchases.
Moreover, in the age of social media, news of your downtime can spread quickly, leading to a loss of customer trust and confidence that can have long-lasting effects.
Here are examples of previous disasters that affected different companies and the effects it had on their teams, customers, and shareholders:
In 2017, a major airline had a global IT failure that stranded 75,000 passengers, costing the company an estimated $100 million and damaging its reputation.
A small e-commerce business experienced an outage during Black Friday, resulting in a loss of sales and a damaged reputation as frustrated customers took to social media to express their dissatisfaction.
Do you want to be one of them? Maybe not; maybe you think you are too small to care as well, but as the examples show, businesses of all sizes can be affected.
And rememver that:
This goes beyond the infra... affecting your teams, customers, and even shareholders
Should you plan for it?
But what about the cost? When I hear 'cost,' I break it down into two different types: monetary cost and reputation cost. Are you willing to suffer a poor reputation among your shareholders and users by saving more money?
Or use money to keep or improve your reputation. It's important to understand that spending money on disaster recovery is not just an expense; it's an investment in your business's resilience and reputation. In this sense, it's similar to purchasing insurance.
Just as you would insure your business's physical assets against fire, theft, or other disasters, it's crucial to invest in a disaster recovery plan to protect your digital assets and ensure business continuity.
This investment can help you avoid the potentially catastrophic costs associated with downtime, data loss, and damaged reputation.
So should I get an insurance broker on the phone now?
Not yet. First, let's understand what we need to know.
The art of recovering from a disaster is possible and should be covered by a disaster recovery plan. Such a plan includes:
Risk Assessment: Identify the critical parts of your business and the risks associated with them.
Recovery Strategies: Determine the best strategies to recover your critical business functions.
Plan Development: Develop a comprehensive recovery plan, including roles, responsibilities, and actions to be taken before, during, and after a disaster.
Testing and Maintenance: Regularly test and update the plan to ensure it remains effective.
Back to AWS, to have a good plan in place, certain metrics should be known in advance. The most important here are the RPO (Recovery Point Objective) and RTO (Recovery Time Objective). Using these two, you can easily define what strategy would make sense.
We will not deep dive into each strategy today, but here is an interesting picture of what your possibilities are. We will get to know each of them soon in another post 😉
Anyhow, next time you are asked in such a meeting that question, you will be able to educate your audience on being highly available and having a plan in case a disaster occurs.
Comments