Shard Split using replica on AWS RDS

Debraj Maity
5 min readFeb 27, 2022
Database Shard Split

Shard splitting are techniques using which one can split a single master database into multiple master databases. Recently at my workplace we were trying to find a simple solution for splitting the database. The solution also needed to have minimal interruption since databases are running on production. So after much deliberation and research we found replica based shard split process to be the best solution.

We use AWS RDS for storing transactional data and a single database holds several terabytes of data. Managing big database instance in production is always a challenge from the operations point of view as there are several issues DBA/Engineers can run into. Few important ones are as follows:

  1. Adding new indexes takes time for big tables especially when read/writes are heavy.
  2. DB backup and snapshot becomes more time consuming as the database size grows.
  3. The larger the database, chances are more customers are hosted on the single database instance so incase of a DB crash, the blast radius is huge.

To mitigate above challenges, we decided to split the database rather than continue to increase provisioned disk size for the single DB node.

While there are several approaches to create a separate master instance and split the data, we were looking for a simple and safe solution for a database running on production. I came across this article from AWS and found it useful for our use case. Based on that we were successfully able to split the data on a running system with minimal downtime.

Following are the summary steps to successfully shard a database

Pre-requisite steps (These are only valid if you have never created sharding logic for your database and used it in read/write services)

Choose shard key. This can be tenant id or customer id or user id depending on boundaries to split the data. Usually it depends on the level of isolation we need at the application side.

Create a shard mapping table with shard key and integer based shard id columns.

Usually assigning shard key to shard id is done using a hash function that take a shard key and assigns it to available shard. This is generally integrated with application logic where addition/deletion of new tenant ids (customers/users) are done.

Assign a shard value to the existing database instance and create entry into the shard mapping table for shard keys with database shard id.

database_mapping_init

We also need another table which maps shard id to database instance. This can be done using some central configuration system from which read and write service can get the data. we can also use database table to store this config map.

shard_key_mapping

Make necessary changes to your application code, so that it can read these mapping tables to figure out which database to connect to for read and write queries.

Migration steps to perform shard split:

  1. Create a new replica for the running master instance that needs to be split. Here is a link on how to create read replica on RDS
  2. Wait till the replica is caught up (this can take several hours or days depending on the size of the master instance so monitor the ReplicaLag metric)
  3. Once replica is caught up, we can proceed to do actual migration.
  4. Pick a suitable time window to minimally disrupt production services. Once decided we can continue with migration.
  5. Stop all the write services. Wait for replica to catchup which usually is pretty instant once there are no new writes happening. (At this point ReplicaLag will be zero)
  6. Since there are no new writes, reads are not stale at this point, that means read services can continue to run. This is good because, read API services can continue to serve so essentially no noticeable interruption.
  7. Promote the read replica into a separate master instance. Here is a link on how to promote read replica to standalone master on AWS RDS.
  8. Once the new master instance is ready, we have migrated all customer data that we need (along with data which we don’t need which we will need to cleanup post migration) as its basically a copy of original database.
  9. To use the new master instance for identified set of customers, update the shard_mapping table to point to new shard id.
  10. Read services might need to be restarted to be able to use the new shard mapping table depending on how application connects to the database.
  11. At this time we can start the write services for writing to the new master shard for the set of customers which have been moved.
database_shard_mapping_updated
shard_mapping_updated

Post migration Steps

Once we are done with shard split process, we have some post migration cleanup to do since there would be customer data that is no longer needed in old database shard as well as new shard.

For example in the above steps, customer id 3 & 4 data can be removed from shard1 database as these ids have been moved to new shard2 and similarly customer id 1 & 2 data can be cleaned up from newly created shard database as original database already have the data for these set of customers.

Usually we can create a script to remove data from all the tables based on the sharding key and schedule it to run as nightly job. The clean up process can take several hours to several days depending on the size of tables.

There it is, a very simple approach to split a single database into multiple databases without any major application traffic interruption.

What do you think? Please feel free to drop a comment. Thank you for reading.

--

--

Debraj Maity

Software Engineer with focus on architecture. Passionate about building innovative software products and solutions. Intensely curious mind.