Which DR strategy for Kubernetes?

gmolaire
Sep 11, 2023
3 min read

Updated: Nov 7, 2023

I was recently asked:

What is the DR strategy you have implemented for your Kubernetes clusters?

And my answer is always the same: Go light, warm or go home. Although it will depend on the budget, RPO and RTO of the underlying business, most of the time light or warm is enough.

So what are you refering to exactly?

Let's go back to the origin of DR in AWS by using this picture to describe the main possible options:

To understand each option, we need to put the context of a disaster. Let's say the disaster is as simple as the region is which your Kubernetes cluster is running is down. Now what?

Backup & Restore

This approach involves periodically taking backups of your cluster's configurations, including data, deployment configurations, and other relevant settings. Using AWS services, you could use Amazon S3 for storing snapshots and EBS volumes, and ECR for container images.

Example with EKS: Assuming your EKS cluster data resides in EBS volumes, you can take regular snapshots of these volumes and store them in S3. If the region in which your EKS cluster is operating goes down, you'd first set up a new EKS cluster in another region, then restore the volumes from your snapshots. You'd also have to redeploy your applications using the configurations and container images you've backed up.

Pilot Light

For EKS, this would mean having a minimal footprint of your cluster always running in another region. Essential services, especially the data layer, are replicated.

Example with EKS: You'd have a smaller EKS cluster running in another region, with critical services like databases in replication modes. AWS RDS, for instance, can be set up in cross-region replication. If your main cluster goes down, you'd scale this cluster up, redirect traffic (perhaps using Route 53 with health checks), and it would take over.

Warm Standby

With EKS, this would be like having a fully functional clone of your production environment but scaled down.

Example with EKS: In a different region, you'd have an EKS cluster with the same configurations as your main cluster but scaled down. Key services would be in standby or running in scaled-down versions. In case of a disaster in the primary region, this cluster can quickly scale to handle the full production load. Services like AWS Auto Scaling groups would be crucial in this strategy.

Active-Active

For EKS, this would involve having multiple EKS clusters in different regions, all serving traffic simultaneously.

Example with EKS: You'd run EKS clusters in two or more regions with a Global Accelerator or a multi-region load balancer like Route 53 routing traffic to all of them. Data replication and synchronization become crucial here, especially if you're running stateful applications. AWS services like DynamoDB Global Tables or cross-region RDS replication would be important components.

Why I suggest Pilot or Warm?

Both the pilot light and warm standby methods provide a balance between cost and recovery speed. They're not as expensive as running an active-active configuration, but they offer faster recovery than a simple backup and restore method. Depending on your specific RTO and RPO requirements, as well as your budget, these methods provide a middle ground that works for many businesses.

Your Decision

Your choice should factor in how fast you need to recover (RTO) and how much data you can afford to lose (RPO). With EKS, things like cluster state, application configurations, and data need to be considered. Also, AWS offers various tools and services, from backup solutions to traffic routing and scaling, which can aid in implementing your chosen strategy.