Designing Active-Active and Disaster Recovery Data Centers

Home » Webinars » Data Center Infrastructure » Designing Active-Active and Disaster Recovery Data Centers

This webinar covers typical design scenarios encountered when building a disaster recovery data center or deploying multiple data centers in an active-active configuration.

Designing Active-Active and Disaster Recovery Data Centers

13:57 Introduction

In the fist section of this webinar we'll try to figure out why we'd want to migrate application workload between data centers, and define a few useful terms like RTO, RPO, MTTR and MTTI.

Introduction and Definitions 13:57 2017-03-28

49:45 Free items Typical Challenges

There are four typical reasons why you'd want to migrate application servers between data centers: migration, disaster recovery or avoidance, and workload load balancing.

Disaster Recovery 6:50 2017-03-28
Disaster Avoidance 29:46 2019-10-12
Data Center Migration 2:47 2017-03-28
Load Balancing Across Data Centers and Cloudbursting 10:22 2017-03-28

28:17 Limitations and Considerations

A number of factors limit our ability to deploy servers across multiple data centers: latency, bandwidth limitations, and data gravity.

Latency 8:09 2017-03-28
Limited Bandwidth 10:55 2017-03-28
Storage considerations 9:13 2017-03-28

16:12 Typical Solutions

Well-designed active-active applications used "swimlanes" - a concept where multiple copies of an application stack reside in different locations.

Parallel Application Stacks (Swimlanes) 16:12 2017-03-28
Describing Fault Domains

A great introduction to fault domains, fault levels, cascading failures, and fault hierarchy.

31:34 Free items Long-Distance VM Mobility Challenges

Instead of redesigning applications to make them work across multiple data centers, enterprise environments typically try to solve the challenges within the infrastructure, sometimes even moving running servers between data centers. This section describes most obvious drawbacks of that idea.

Inter-DC vMotion Bandwidth 6:04 2017-05-02
Large Layer-2 Domains 9:27 2017-05-02
Ingress and Egress Traffic Flows 16:03 2017-05-02

42:37 Summary & Questions

Time for a wrap-up. We'll discuss the right way of doing things, surviving infrastructure failures, and typical real-life designs.

Surviving the Failures 15:37 2017-05-02
The Right Way of Doing Things 10:12 2017-05-02
Typical Real-Life Designs 9:05 2017-05-02
Summary and Questions 7:43 2017-05-02

Slide Deck

Designing Active-Active and Disaster Recovery Data Centers 11M 2015-11-07

Additional Resources

The blog posts, articles, and books collected in this section might help you get a broader perspective on high-availability application architectures.

Application Design and Operations

Scalability Rules: Principles for Scaling Web Sites (2nd Edition)

A must-read book for anyone interested in robust high-availability application design.

Systems Design for Advanced Beginners
Site Reliability Engineering: How Google Runs Production Systems
More Site Reliability Engineering (SRE) resources

High Availability Architectures

Disaster Recovery in AWS: Strategies
Disaster Recovery in AWS: Architecture and Patterns
Disaster Recovery in AWS: Backup and Restore

Load Balancing and Service Discovery

Load balancing in Google network
Building a billion user load balancer (Facebook)
Ananta: Cloud Scale Load Balancing (Microsoft Azure)
GitHub Load Balancer
A quick intro to Consul
DNS-based Load Balancing with NSONE (podcast)

Redundancy and Resiliency

Redundant network designs usually use 1+1 redundancy. Applications (at least the database layer) are usually no better. However, 1+1 redundancy might not be good enough, and too much redundancy might decrease the overall availability.

1+1 Redundancy Just Isn’t Good Enough
Gray failures: the Achilles’ heel of cloud-scale systems
Why Shared Mutable State Is the Root of All Evil

Testing Resilient Application Stacks

Resilience Engineering: Learning to Embrace Failure
The Netflix Simian Army
Simian Army source code on GitHub
Testing in Production: Yes, You Can
You started this section on %started% Mark completed