Daniel Reis

Posted on Aug 21

Database 101: How does migration between databases work in the real world?

#database #scylladb #programming #beginners

Recently my new task at ScyllaDB is to study how to Migrate data between Databases and as I promised before, I'll keep you folks about my latest studies.

If you’re just getting started with databases in general or databases, you might want to start off by reading my initial post, Database 101: Data Consistency for Beginners for Beginners. That article captures my own exploration of how many database paradigms exist as I look far beyond my previous experience with just SQL and MySQL. I’m keeping track of my studies in this Database 101 series.

1. Prologue
2. Database Migrations: How to Start
3. Why are we doing this migration?
4. Proof of Concept: where to start?
- 4.1 PoC: Same Paradigm
- 4.2 PoC: Different Paradigm
5. Testing it before it's too late! (seriously)
6. Migration Strategies
- 6.1 Strategy: Cold Migration
- 6.2 Strategy: Hot Migration
7. Final Considerations

1. Prologue

Since I started my journey in this "database environment", I have asked myself so many questions that probably will take a few years to be answered. But one of those questions is how to properly perform a database migration, and I still don't have the answer.

Actually, when I migrated the He4rt Developers (community I lead) bot, I did my best to make everything on the backend and the database model as perfect as possible. But when I had to migrate everything, it was a real mess. I decided to use the "Big Bang" (pull request with 532 files changed) migration approach, and it was one of my bad decisions as an open source maintainer.

At that time, I didn't have any knowledge about this topic (database migrations) at all. But I learned a lesson that I'm going to teach you in this article.

2. Database Migrations: How to Start

If you ever considered receive such a task, of investigate and research how properly migrate data from a database to another, you should know a few things before do anything related to code. What do I mean by that? Just throw queries between databases is not where you're going to start the investigation.

Let's just ask ourselves a few questions before everything:

What is the motivation for database migration?
What migration approach should we use on the project?
Are we going to jump from one paradigm to another?
Does the new database have all the support we need for the things we're currently using?

With these four questions in mind, we can start researching until answer all of that and then put hands on queries/code.

3. Why are we doing this migration?

When we talk about replacing any part of the stack of a running project, the reason needs to be clear to the entire team that is planning this huge movement inside the product.

So what are you looking for in the replacement? A few possibilities:

Lower latency for I/O operations;
Cheaper database to maintain;
Scalable database to keep up with the product.

Just a reminder that these are just a few items on the giant list of motivations for why to replace a database, ok? Ask your team/leaders all the details you need to understand the motivation and make sure this is what you're looking for!

4. Proof of Concept: where to start?

Good! We have a clear idea of why we are doing this. Now let's figure out the things that will be important to make this work. Do you know the difference between each database paradigm? Well, if you're migrating data between two different databases, it should be a constant concern. Let's understand which type of PoC you will be running.

4.1 PoC: Same Paradigm

Imagine a few scenarios where you want to migrate from:

CassandraDB to ScyllaDB (Wide Column);
MySQL to PostgreSQL (Relational);
Memcache to Redis (Key-value);
MongoDB to DynamoDB or Firebase (document).

All of these scenarios are kinda easier to migrate, because they share the same Database Paradigm, which means that they follows the same architecture.

So, they have mostly the same features, indexing and data types. Maybe will have slight differences but are things that you can handle by doing tests and migrating from a feature/data type to another.

Even when Discord was migrating from CassandraDB to ScyllaDB had a few things to fix, but at the end of the day it's simpler than between different paradigms. To be super honest, if you have a problem during the PoC, remember that it's part of the job. ¯\(ツ)/¯

4.2 PoC: Different Paradigm

Ok, and if our problem is to go from some document based database to a wide column? Like going from a MongoDB to a ScyllaDB?

Sounds like a tough task, and probably will be, and this part especially needs to be carefully designed.

When Discord started migrating from Document Based Database to the Wide Column Database in 2017, they had the same issue.

How do you split an infinite JSON object? Can you imagine? Check the JSON example below:

{
  "servers": [
    {
      "218378123781": {
        "id": "218378123781",
        "server_name": "He4rt Developers",
        "channels": [
          {
            "78931278921723": {
              "name": "Test Channel 1",
              "messages": [
                {
                  "312783712867": {
                    "message_id": "312783712867",
                    "chatter_id": "danielhe4rt",
                    "content": "oh hi lol",
                    "created_at": "1691972222",
                    "updated_at": "1691972222"
                  }
                },
                {
                  "312783712867": {
                    "message_id": "312783712867",
                    "chatter_id": "danielhe4rt",
                    "content": "oh hi lol",
                    "created_at": "1691972222",
                    "updated_at": "1691972222"
                  }
                }
              ]
            }
          }
        ]
      }
    }
  ],
  "users": [
    {
      "danielhe4rt": {
        "chatter_id": "danielhe4rt",
        "joined_at": "1691972222"
      }
    },
    {
      "dont_forget_to_follow_me_on_socials": {
        "chatter_id": "dont_forget_to_follow_me_on_socials",
        "joined_at": "1691972222"
      }
    }
  ]
}

This is how a briefly imagined a social modeling like Discord. Yeah, I know. That's far from the reality but let's stick into the problem here, ok? And in fact, if you have a good DBA and architects on your side, it will not be a problem.

First, you need the whole team to know more about the paradigm and how to use it. With document-oriented, you can push whatever you want because there's no strong data consistency at all. But if you move to ScyllaDB, you'll need to model tables, focus on what query you want to run, and also understand how things like "consistency level" and "replication factor" work.

At the end we should have something like:

CREATE TABLE servers (
    server_id bigint
    owner_id text
    server_name text,
    created_at timestamp,
    PRIMARY KEY (server_id, created_at)
);

CREATE TABLE channels (
    server_id bigint
    channel_id bigint
    channel_name text,
    created_at timestamp
    PRIMARY KEY ((server_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

CREATE TABLE channel_messages (
    message_id bigint
    channel_id bigint,
    chatter_id text,
    content text,
    created_at timestamp,
    PRIMARY KEY ((message_id, channel_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

CREATE TABLE users (
    user_id text,
    user_name text,
    created_at timestamp,
    PRIMARY KEY (user_id, created_at)
);

Yeah, CQL (Cassandra Query Language) is very similar to SQL (Structured Query Language). I invite you to have a try doing this ScyllaDB: NoSQL 101 Essentials course. You will learn a bunch of cool things about the Wide Column paradigm :D

5. Testing it before it's too late! (seriously)

Ok, we ran our PoC and we know which database and paradigm will be the best for our new environment. Now your team is focused to finish the data modeling ASAP, right? Once that is done, the focus will be on getting all the scripts ready to make this a successful migration.

The best thing that you can do is to split your tests with 1% and 5% of your database in a staging environment. Why? If your "new system" or "new version" of the system crashes with 1% of the database, it will reflect that there's something missing in the data modeling or even at the backend. So, stop the migration scripts and write some tests for your application.

Now that you're done with the 1% migration bugs, let's do the same process with 5% of your database. In this step you should test everything carefully, because it will be the last testing step you will do before choosing the migration strategy and executing it.

6. Migration Strategies

Of course, migrating between databases isn't a "new" thing, so there are a few things planned and ready to make your life easier. Also, a few rules that together will make your database migration look like it was done by a pro.

In this regard, there are two very popular strategies called "Hot Migration" and "Cold Migration". The names of the strategies themselves say something that we've probably been asking ourselves since the beginning of this research: Should I shut everything down or migrate while the old database is still running? Well, that depends on your needs. The process itself is called ETL (Extract, Transform and Load) in any case btw.

6.1 Strategy: Cold Migration

Let's start with the "easiest" and safer migration strategy: Cold Migration. The reason to use this type of migration is understand if you can stop your system for the time needed without any problem. Basically, know if there's a maintenance period and let users know that. It also may known as "Big Bang Migration", since it will be migrated entirely to a new environment that had nothing before.

This approach is used on small/medium systems and is usually done at dawn, since it's probably the time of day when fewer users are requesting the system.

There's a checklist of things you need to be aware of when doing a cold migration. Let's check it out:

1. Planning: what time of day and who will be responsible for this task?
2. Extraction: how are we going to dump all this data? Maybe there's specific tooling already done on the target database to be used.
3. Transformation: is the data in the right form for the new database? Do both databases have the same structure? Is there any tooling to transform this data for us?
4. Load: how do we make sure that our new database gets all of our data? Any broken query? All good?
5. Validation: ok, we loaded everything. Now we need to validate if the data matches with the legacy system. Good Luck!
6. Testing: let's connect it in the migrated database into the system and run all the test suites. Almost there!!!
7. Activation: all good. Now it's time to shut down the legacy database. Congratz, buddy!

Basically, this is the migration flow when you have the possibility to shut everything down and now you know how to properly do it.

6.2 Strategy: Hot Migration

Ok... If you want to migrate things while your system is still running... Now you're in trouble! Wait, I'm just kidding :p

This is where things get a bit complicated because you have to write to two databases at the same time. So you can imagine that instead of having 1 instance of the database, you will need two, and depending on that

We also have a checklist to help us with this. So let's check the flow:

1. Planning: time of the day is not the main issue here, but a proper infrastructure to support the transition between both databases;
2. Replication: both systems need to be feed at the same time while the migration is happening
3. Tests and Validation: during the replication, you should validate all the data that you're inserting along with the migrated one.
4. Minimal Interruption: the system should be online all the time, and if any problem occurs, it should not be perceptible by the user.
5. Switch: When all the important data is finally migrated, is time to point the app to the new database;
6. Activation: all good. Now it's time to shut down the legacy database. Congratz, buddy!

This migration strategy is indicated for companies that require high availability and have critical services that cannot be stopped.

7. Final Considerations

Well, my task was to understand more about DynamoDB and create some content about the ScyllaDB Alternator, a migration tool that converts any DynamoDB I/O into a ScyllaDB. I'm still working in a PoC to understand more about it, but I certainly learned some really cool things along the way to know more tips, tricks, and last names of things related to migrations.

Also there will be an "NoSQL Database Migration Masterclass" for free event tomorrow (22/08)! I'll be there on the chat learning more about the topic and invite you to be there with me!

I hope that this tutorial could help you to understand briefly this topic! And please, let me know in the comments which things that I should learn about.

Don't forget to like and share this article with your friends and go fill your water bottle! Let's keep in touch:

Twitter DanielHe4rt PT-BR
Twitter DanielHe4rt EN
Twitch Channel