RDS Backups, Snapshots, and Read Replicas
Deslize para mostrar o menu
Carlos's production RDS database crashed on a Tuesday afternoon. The cause: a Friday deployment had silently dropped an index, and Monday's traffic finally pushed the database past its CPU limit. The team needed to roll back to Friday's state. The question was how — and the answer depends on what kind of backup they had.
This chapter covers the three RDS features that handle backup, recovery, and read scaling: automated backups, manual snapshots, and read replicas. They look similar at first glance and do very different things.
Automated Backups
When you create an RDS instance, automated backups are on by default. Two things happen:
- Every day during a chosen backup window, RDS takes a snapshot of the entire database;
- Every 5 minutes, RDS uploads transaction logs to S3. Together, these enable point-in-time recovery (PITR) — you can restore the database to any second within the retention window. Carlos's team needed PITR to the moment just before Friday's deployment.
Key facts:
- Retention is configurable from 1 to 35 days;
- Backups are stored in S3 by AWS, in the same region (not visible to you directly);
- Setting retention to 0 disables automated backups entirely — almost never the right call;
- Restoring creates a new RDS instance. You cannot overwrite the existing one;
- Automated backups are deleted when the source instance is deleted, unless you take a final snapshot.
Manual Snapshots
A manual snapshot is a point-in-time copy you trigger explicitly via the Console, CLI, or SDK. It captures the full state of the database and lives until you delete it — not bound by the automated retention window.
Use manual snapshots for:
- Pre-deployment safety nets ("snapshot before the migration");
- Long-term compliance ("keep the year-end state for 7 years");
- Sharing data with another AWS account;
- Copying between regions for disaster recovery. Manual snapshots survive instance deletion. They also survive the AWS account, if you copy them to another account first.
Snapshots vs Backups: The Practical Difference
The split, in one sentence: automated backups give you point-in-time recovery within the retention window; manual snapshots give you discrete restore points you control forever.
Most production setups use both:
- Automated backups with 35-day retention for tactical recoveries;
- Manual snapshots before any risky operation, kept indefinitely. Both create new instances when restored. Neither modifies the live database.
Read Replicas
A read replica is a separate RDS instance that asynchronously copies data from a primary database. Reads can be directed to the replica, scaling read capacity without touching the primary's load.
Use cases:
-
Offloading expensive reporting queries from the primary;
-
Geographic distribution — a read replica in another region serves local users with low latency;
-
Read-heavy workloads where the primary is the bottleneck. Key facts:
-
Up to 15 read replicas per source database for most engines;
-
Replication is asynchronous — replicas are typically seconds behind, sometimes more under load;
-
You can promote a read replica to a standalone database, which breaks the replication link;
-
Cross-region read replicas are common for disaster recovery and global reads.
Read Replica vs Multi-AZ
These two get confused on every exam. They are different:
- Multi-AZ — synchronous standby in another AZ, automatic failover, not readable while standby. For high availability;
- Read replica — asynchronous copy, manual or DNS-based failover, readable. For scaling reads and disaster recovery. A production setup often uses both: Multi-AZ for resilience inside the region, plus one or more read replicas for scaling.
What Carlos Did
The team had automated backups with 14-day retention. They created a new instance restored to 2:47pm on Friday — the timestamp just before the deployment. They renamed the new instance to take over the application's connection string, kept the broken instance around for 24 hours to verify nothing was missed, then deleted it. Total downtime: 38 minutes.
A team without automated backups would have rebuilt from the most recent manual snapshot — which on most teams is days or weeks old, not minutes.
For the Exam
Three patterns DVA-C02 tests:
- Need to restore to a specific second in the last 35 days? → Automated backups + PITR;
- Need a long-term snapshot that survives the instance? → Manual snapshot;
- Need high availability vs read scaling? → Multi-AZ vs Read replicas. Mix these up on the exam and you will lose easy points.
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo