Designing Active-Active and Disaster Recovery Data Centers
Home » Webinars » Data Center Infrastructure » Designing Active-Active and Disaster Recovery Data Centers
13:57 Introduction |
||
In the fist section of this webinar we'll try to figure out why we'd want to migrate application workload between data centers, and define a few useful terms like RTO, RPO, MTTR and MTTI. |
||
Introduction and Definitions | 13:57 | 2017-03-28 |
49:45 Free items Typical Challenges |
||
There are four typical reasons why you'd want to migrate application servers between data centers: migration, disaster recovery or avoidance, and workload load balancing. |
||
Disaster Recovery | 6:50 | 2017-03-28 |
Disaster Avoidance
![]() |
29:46 | 2019-10-12 |
Data Center Migration | 2:47 | 2017-03-28 |
Load Balancing Across Data Centers and Cloudbursting | 10:22 | 2017-03-28 |
28:17 Limitations and Considerations |
||
A number of factors limit our ability to deploy servers across multiple data centers: latency, bandwidth limitations, and data gravity. |
||
Latency | 8:09 | 2017-03-28 |
Limited Bandwidth | 10:55 | 2017-03-28 |
Storage considerations | 9:13 | 2017-03-28 |
16:12 Typical Solutions |
||
Well-designed active-active applications used "swimlanes" - a concept where multiple copies of an application stack reside in different locations. |
||
Parallel Application Stacks (Swimlanes) | 16:12 | 2017-03-28 |
Describing Fault Domains | ||
A great introduction to fault domains, fault levels, cascading failures, and fault hierarchy. |
||
31:34 Free items Long-Distance VM Mobility Challenges |
||
Instead of redesigning applications to make them work across multiple data centers, enterprise environments typically try to solve the challenges within the infrastructure, sometimes even moving running servers between data centers. This section describes most obvious drawbacks of that idea. |
||
Inter-DC vMotion Bandwidth
![]() |
6:04 | 2017-05-02 |
Large Layer-2 Domains | 9:27 | 2017-05-02 |
Ingress and Egress Traffic Flows | 16:03 | 2017-05-02 |
42:37 Summary & Questions |
||
Time for a wrap-up. We'll discuss the right way of doing things, surviving infrastructure failures, and typical real-life designs. |
||
Surviving the Failures | 15:37 | 2017-05-02 |
The Right Way of Doing Things | 10:12 | 2017-05-02 |
Typical Real-Life Designs | 9:05 | 2017-05-02 |
Summary and Questions | 7:43 | 2017-05-02 |
Slide Deck |
||
Designing Active-Active and Disaster Recovery Data Centers | 11M | 2015-11-07 |
Additional Resources |
||
The blog posts, articles, and books collected in this section might help you get a broader perspective on high-availability application architectures. |
||
Application Design and Operations |
||
Scalability Rules: Principles for Scaling Web Sites (2nd Edition) | ||
A must-read book for anyone interested in robust high-availability application design. |
||
Systems Design for Advanced Beginners | ||
Site Reliability Engineering: How Google Runs Production Systems | ||
More Site Reliability Engineering (SRE) resources | ||
Load Balancing and Service Discovery |
||
Load balancing in Google network | ||
Building a billion user load balancer (Facebook) | ||
Ananta: Cloud Scale Load Balancing (Microsoft Azure) | ||
GitHub Load Balancer | ||
A quick intro to Consul | ||
DNS-based Load Balancing with NSONE (podcast) | ||
Redundancy and Resiliency |
||
Redundant network designs usually use 1+1 redundancy. Applications (at least the database layer) are usually no better. However, 1+1 redundancy might not be good enough, and too much redundancy might decrease the overall availability. |
||
1+1 Redundancy Just Isn’t Good Enough | ||
Gray failures: the Achilles’ heel of cloud-scale systems | ||
Why Shared Mutable State Is the Root of All Evil | ||
Testing Resilient Application Stacks |
||
Resilience Engineering: Learning to Embrace Failure | ||
The Netflix Simian Army | ||
Simian Army source code on GitHub | ||
Testing in Production: Yes, You Can |
Data Center Infrastructure











