Design and implement disaster recovery strategies with RTO/RPO planning, database backups, Kubernetes DR, cross-region replication, and chaos engineering testing. Use when implementing backup systems, configuring point-in-time recovery, setting up multi-region failover, or validating DR procedures.
View on GitHubancoleman/ai-design-components
backend-ai-skills
February 1, 2026
Select agents to install to:
npx add-skill https://github.com/ancoleman/ai-design-components/blob/main/skills/planning-disaster-recovery/SKILL.md -a claude-code --skill planning-disaster-recoveryInstallation paths:
.claude/skills/planning-disaster-recovery/# Disaster Recovery ## Purpose Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering. ## When to Use This Skill Invoke this skill when: - Defining recovery time objectives (RTO) and recovery point objectives (RPO) - Implementing database backups with point-in-time recovery (PITR) - Setting up Kubernetes cluster backup and restore workflows - Configuring cross-region replication for high availability - Testing disaster recovery procedures through chaos experiments - Meeting compliance requirements (GDPR, SOC 2, HIPAA) - Automating backup monitoring and alerting - Designing multi-cloud disaster recovery architectures ## Core Concepts ### RTO and RPO Fundamentals **Recovery Time Objective (RTO):** Maximum acceptable downtime after a disaster before business impact becomes unacceptable. **Recovery Point Objective (RPO):** Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach. **Criticality Tiers:** - **Tier 0 (Mission-Critical):** RTO < 1 hour, RPO < 5 minutes - **Tier 1 (Production):** RTO 1-4 hours, RPO 15-60 minutes - **Tier 2 (Important):** RTO 4-24 hours, RPO 1-6 hours - **Tier 3 (Standard):** RTO > 24 hours, RPO > 6 hours ### 3-2-1 Backup Rule Maintain **3 copies** of data on **2 different media** types with **1 copy offsite**. Example implementation: - Primary: Production database - Secondary: Local backup storage - Tertiary: Cloud backup (S3/GCS/Azure) ### Backup Types **Full Backup:** Complete copy of all data. Slowest to create, fastest to restore. **Incremental Backup:** Only changes since last backup. Fastest to create, requires full + all incrementals to restore. **Differential Backup:** Change