AWS Aurora Global Database Recovery

Manish Sharma
6 min readApr 19, 2023

--

By using an Aurora global database, you can plan for recovery from disaster fairly quickly.

Recovery from disaster is typically measured using values for RTO and RPO.

  • Recovery time objective (RTO) — The time it takes a system to return to a working state after a disaster. In other words, RTO measures downtime. For an Aurora global database, RTO can be in the order of minutes.
  • Recovery point objective (RPO) — The amount of data that can be lost (measured in time). For an Aurora global database, RPO is typically measured in seconds.

With an Aurora global database, there are two different approaches to failover depending on the scenario.

  1. Manual unplanned failover (“detach and promote”)
  • To recover from an unplanned outage or to do disaster recovery (DR) testing, perform a cross-Region failover to one of the secondaries in your Aurora global database.
  • The RTO for this manual process depends on how quickly you can perform the all the manual tasks listed.
  • The RPO is typically measured in seconds, but this depends on the Aurora storage replication lag across the network at the time of the failure.

2. Managed planned failover

  • This can only be perform for a managed cross-Region database failover on an Aurora global database if the primary and secondary DB clusters have the same major, minor, and patch level engine versions. However, the patch levels can be different, depending on the minor engine version.
  • This feature is intended for controlled environments, such as operational maintenance and other planned operational procedures.
  • By using managed planned failover, you can relocate the primary DB cluster of your Aurora global database to one of the secondary Regions.
  • This feature synchronizes secondary DB clusters with the primary before making any other changes, RPO is 0 (no data loss).
Figure: Initial State
Figure: Initiate Manual Failover to secondary region
Figure: Final State after manual planned failover

Recovery from an Unplanned Outage to promote Secondary to Master

For any of the reason Aurora global database can have outage in its primary AWS region. In this case primary site writer instance won’t be accessible and replication between primary region to secondary region will also be stopped.

Fig: Aurora Global Database — Primary region is down

As a BCDR planning you need to quickly perform failover to secondary site to reduce down recovery time objective (RTO) from the database point of view.

Follow below steps to failover to secondary cluster in a Aurora global database,

  • Even primary site is down stop sending any kind of DDL/DML commands to primary database cluster and for this if needed stop all the applications/processes which connects to Aurora database. If application is deployed on Kubernetes then set the replica count to 0.
Fig: Stop sending traffic to primary site writer instance
  • As per the business needs an Aurora global database can have read replicas in multiple secondary regions. In this case identify the secondary region database cluster which has least replication lag time.
  • Detach the identified secondary database cluster from the Aurora global database which makes it standalone Aurora DB cluster as it is not part of the global database now.
Fig: Detaching Secondary DB from Auroral Global Database
  • If any other secondary Aurora DB clusters associated with the primary cluster in the Region with the outage are still available and can accept calls from the application. They also consume resources. As you’re recreating the Aurora global database, remove the other secondary DB clusters before creating the new Aurora global database. Doing this avoids data inconsistencies among the DB clusters in the Aurora global database (split-brain issues).
  • Now, promote this secondary standalone database cluster to primary instance with read and write capabilities. This step will change the cluster endpoint. For example, endpoint my-global.cluster-ro-aabb.us-west-2.rds.amazonaws.com will now become my-global.cluster-aabb.us-west-2.rds.amazonaws.com
Fig: Promoting standalone secondary read-replica to primary
  • Reconfigure your application using new endpoint of the database cluster. Start all the stopped applications/process to send all write operations to this now standalone Aurora DB cluster.
Fig: Sending Applications Request to Secondary DB Cluster
  • (Optional) Add AWS Regions to the DB cluster as needed to recreate the topology needed to support your application. This step will begin the replication process from primary to secondary region.
Fig: Add Secondary Read Replica Region

After the outage issue is resolve you can make that AWS Region the primary again. To do so, you add the old AWS Region (ex: us-west-1) to your new global database, and then use the managed planned failover process to switch its role.

Performing Managed Planned Failover to failover back to Original Primary Site

  • Sign in to the AWS Management Console and open the Amazon RDS console at https://console.aws.amazon.com/rds/
  • Choose Databases and find the Aurora global database to which you want to add original primary AWS region (us-west-1)
  • Add original primary site AWS Region (us-west-1) to the Aurora Global DB. This step will begin the replication process from primary (us-west-2) to secondary region (us-west-1).
  • Check lag times for secondary Aurora DB clusters in the Aurora global database. Use Amazon CloudWatch to view the AuroraGlobalDBReplicationLag metric for secondary. This metric tells you how far behind (in milliseconds) a secondary is to the primary DB cluster. Its value is directly proportional to the time it'll take for Aurora to complete failover. This value must be equals to 0 to avoid any data loss
  • Take applications offline to prevent writes from being sent to the primary cluster of Aurora global database.
  • Choose Fail over global database in the Actions menu. The failover process doesn’t begin until after you choose the failover target in the next step.
Fig: Failover Global Database
  • Choose the secondary Aurora DB cluster that you want to promote to primary. The secondary DB cluster must be available. If you have more than one secondary DB cluster, you can compare the lag amount for all secondaries and choose the one with the smallest amount of lag.
  • Choose Fail over global database to confirm your choice of secondary DB cluster and begin the failover process.
  • The failover process can take some time to complete and so your database is unavailable for a short time while the primary and selected secondary clusters are assuming their new roles. For this reason perform this operation during nonpeak hours or at another time when writes to the primary DB cluster are minimal.

The Role column of the Databases list shows the state of each Aurora DB instance and Aurora DB cluster during the failover process.

  • When the failover completes, you can see the Aurora DB clusters and their current role state in the Databases list, as shown following
  • When the failover process completes, the promoted Aurora DB cluster can handle write operations for the Aurora global database. Make sure to change the endpoint for your application to use the new endpoint.

--

--

Manish Sharma

I am technology geek & keep pushing myself to learn new skills. I am AWS Solution Architect — Associate, Professional & Terraform Associate Developer certified.