AWS định nghĩa 4 DR strategies theo thứ tự từ cost thấp/RTO cao đến cost cao/RTO thấp:
(1) Backup & Restore (RTO hours, RPO hours): chỉ backup data lên S3/Glacier, không duy trì environment; khi disaster xảy ra, restore từ backup và rebuild infrastructure — rẻ nhất nhưng downtime lâu nhất; phù hợp non-critical workload.
(2) Pilot Light (RTO 10-30 phút, RPO phút): maintain core components tối thiểu ở DR region luôn chạy (database replicated, AMIs ready) nhưng scale về 0 cho compute; khi fail, provision compute từ AMI, scale up database — nhanh hơn backup/restore nhưng vẫn cần thời gian provision.
(3) Warm Standby (RTO phút, RPO giây): scaled-down version của production environment luôn chạy ở DR region (minimum size Auto Scaling Group, database replica); khi fail, scale up ASG và promote database replica — gần như immediate failover.
(4) Multi-Site Active-Active (RTO near-zero, RPO near-zero): full production capacity chạy ở nhiều region đồng thời, Route 53 hoặc Global Accelerator phân phối traffic; đắt nhất, phức tạp nhất nhưng không downtime.
RTO (Recovery Time Objective) = thời gian từ lúc disaster đến lúc service restored; RPO (Recovery Point Objective) = lượng data tối đa có thể mất (bao lâu backup một lần). DR testing thường xuyên quan trọng như implement — Game Day exercises, chaos engineering.
AWS defines 4 DR strategies in order from lowest cost/highest RTO to highest cost/lowest RTO:
(1) Backup & Restore (RTO: hours, RPO: hours): only data is backed up to S3/Glacier, with no environment maintained; when disaster strikes, restore from backup and rebuild infrastructure — cheapest but longest downtime; suitable for non-critical workloads.
(2) Pilot Light (RTO: 10-30 minutes, RPO: minutes): keep a minimal set of core components running in the DR region (replicated database, AMIs ready) but scale compute to zero; on failure, provision compute from AMIs and scale up the database — faster than backup/restore but still requires provisioning time.
(3) Warm Standby (RTO: minutes, RPO: seconds): a scaled-down version of the production environment is always running in the DR region (minimum-size ASG, database replica); on failure, scale up the ASG and promote the database replica — near-immediate failover.
(4) Multi-Site Active-Active (RTO: near-zero, RPO: near-zero): full production capacity runs in multiple regions simultaneously; Route 53 or Global Accelerator distributes traffic — most expensive and complex but with no downtime.
RTO (Recovery Time Objective) = time from disaster to service restoration; RPO (Recovery Point Objective) = maximum acceptable data loss (how often backups are taken). Regular DR testing is as important as implementation — use Game Day exercises and chaos engineering.