Designing Active-Active and Disaster Recovery Data Centers
Home » Webinars » Data Center Infrastructure » Designing Active-Active and Disaster Recovery Data Centers
Last modified on 2024-10-19 (release notes)
Designing Active-Active and Disaster Recovery Data Centers
13:57 Free items Introduction |
||
In the fist section of this webinar we'll try to figure out why we'd want to migrate application workload between data centers, and define a few useful terms like RTO, RPO, MTTR and MTTI. |
||
Introduction and Definitions | 13:57 | 2017-03-29 |
49:45 Free items Typical Challenges |
||
There are four typical reasons why you'd want to migrate application servers between data centers: migration, disaster recovery or avoidance, and workload load balancing. |
||
Disaster Recovery | 6:50 | 2017-03-29 |
Disaster Avoidance | 29:46 | 2017-03-29 |
Data Center Migration | 2:47 | 2017-03-29 |
Load Balancing Across Data Centers and Cloudbursting | 10:22 | 2017-03-29 |
28:17 Free items Limitations and Considerations |
||
A number of factors limit our ability to deploy servers across multiple data centers: latency, bandwidth limitations, and data gravity. |
||
Latency | 8:09 | 2017-03-29 |
Limited Bandwidth | 10:55 | 2017-03-29 |
Storage considerations | 9:13 | 2017-03-29 |
16:12 Typical Solutions |
||
Well-designed active-active applications used "swimlanes" - a concept where multiple copies of an application stack reside in different locations. |
||
Parallel Application Stacks (Swimlanes) | 16:12 | 2017-03-29 |
Describing Fault Domains | ||
A great introduction to fault domains, fault levels, cascading failures, and fault hierarchy. |
||
31:34 Free items Long-Distance VM Mobility Challenges |
||
Instead of redesigning applications to make them work across multiple data centers, enterprise environments typically try to solve the challenges within the infrastructure, sometimes even moving running servers between data centers. This section describes most obvious drawbacks of that idea. |
||
Inter-DC vMotion Bandwidth | 6:04 | 2017-05-03 |
Large Layer-2 Domains | 9:27 | 2017-05-03 |
Ingress and Egress Traffic Flows | 16:03 | 2017-05-03 |
42:37 Summary & Questions |
||
Time for a wrap-up. We'll discuss the right way of doing things, surviving infrastructure failures, and typical real-life designs. |
||
Surviving the Failures | 15:37 | 2017-05-03 |
The Right Way of Doing Things | 10:12 | 2017-05-03 |
Typical Real-Life Designs | 9:05 | 2017-05-03 |
Summary and Questions | 7:43 | 2017-05-03 |
1:27:00 Lessons Learned Operating Active-Active Data Centers |
||
Networking and virtualization vendors keep proposing crazier and crazier ideas that are supposed to allow you to run active-active data centers without touching the application architecture. Not surprisingly, most of them fail disastrously under the right failure conditions. If you want to have a highly-available application, there's simply no substitute for good design including global and local load balancing. In his presentation, Ethan Banks described the architecture he used when running multiple data centers for a large credit card payment processor, and lessons learned while operating them. |
||
Definitions and Typical Setup | 7:44 | 2016-10-09 |
Internet Edge, DNS, and BGP | 16:08 | 2016-10-09 |
Firewalls | 11:15 | 2016-10-09 |
Load Balancers | 14:07 | 2016-10-09 |
Core Network | 20:22 | 2016-10-09 |
High-Level Comments and Conclusions | 17:24 | 2016-10-09 |
Free items Slide Deck |
||
Designing Active-Active and Disaster Recovery Data Centers | 11M | 2015-11-07 |
Disaster Recovery Myths | 6.2M | 2024-10-19 |
36:07 From the ipSpace.net Design Clinic |
||
Migrating Application Stacks into Public Clouds | 16:36 | 2021-12-27 |
Running Applications in Multi-Cloud Environment | 19:31 | 2022-05-30 |
Additional Resources |
||
The blog posts, articles, and books collected in this section might help you get a broader perspective on high-availability application architectures. |
||
Application Design and Operations |
||
Scalability Rules: Principles for Scaling Web Sites (2nd Edition) | ||
A must-read book for anyone interested in robust high-availability application design. |
||
Systems Design for Advanced Beginners | ||
Site Reliability Engineering: How Google Runs Production Systems | ||
More Site Reliability Engineering (SRE) resources | ||
The Four Horsemen of Network Communication | ||
Disaster Recovery in AWS |
||
High availability concepts don't change just because you're deploying your workloads in a public cloud. If anything, public clouds require cleaner architectures as they don't support enterprise kludges like layer-2 DCI. It's therefore worth reading the series of articles describing disaster recovery solutions within AWS. |
||
Strategies | ||
Architecture and Patterns | ||
Backup and Restore | ||
Pilot Light and Warm Standby | ||
Multi-site Active/Active | ||
Implementing Multi-Region Disaster Recovery Using Event-Driven Architecture | ||
Disaster Recovery with AWS Services |
||
AWS published several blog posts describing how you could use AWS services in a disaster recovery process. These documents are obviously self-serving, but you might find them valuable should you decide to deploy your workload on AWS, or you could use the same concepts when implementing disaster recovery in a different environment. |
||
Disaster Recovery with AWS Managed Services (Single Region) | ||
Multi-Region Backup and Restore | ||
AWS Multi-Region Application Architecture with AWS Services |
||
Part 1: Compute, Networking, and Security | ||
Part 2: Data and Replication | ||
Part 3: Application Management and Monitoring | ||
Minimizing Dependencies in a Disaster Recovery Plan | ||
Load Balancing and Service Discovery |
||
Load balancing in Google network | ||
Building a billion user load balancer (Facebook) | ||
Ananta: Cloud Scale Load Balancing (Microsoft Azure) | ||
GitHub Load Balancer | ||
A quick intro to Consul | ||
DNS-based Load Balancing with NSONE (podcast) | ||
Redundancy and Resiliency |
||
Redundant network designs usually use 1+1 redundancy. Applications (at least the database layer) are usually no better. However, 1+1 redundancy might not be good enough, and too much redundancy might decrease the overall availability. |
||
1+1 Redundancy Just Isn’t Good Enough | ||
Gray failures: the Achilles’ heel of cloud-scale systems | ||
Why Shared Mutable State Is the Root of All Evil | ||
Testing Resilient Application Stacks |
||
Resilience Engineering: Learning to Embrace Failure | ||
The Netflix Simian Army | ||
Simian Army source code on GitHub | ||
Testing in Production: Yes, You Can | ||
AWS Fault Injection Simulator | ||
Toxiproxy: a Framework for Simulating Network Conditions |