DynamoDB Scan Vs Query Operation Experiment Result

#aws #database #nosql #performance

I've been using AWS DynamoDB for the past couple of years (a NoSQL database). Amazon DynamoDB is a key-value and document database with a response time in single digit milliseconds. It is a fully managed, multi-region, multi-active, persistent database for internet-scale applications that has built-in security, backup and restoration, and in-memory caching.

DynamoDB supports two types of read operations: Query and Scan.
To find information, a query operation uses either the primary key or the index key. Scan, as the name implies, is a read call that scans the entire table for a specified result. DynamoDB is designed to be query-optimized.

DynamoDB also supports secondary indexes, which allow lookups based on keys other than the primary key. More complex queries on DynamoDB data are occasionally required. Instead of scanning for such queries, it is usually preferable to create a GSI (global secondary index).

Out of interest, I ran an experiment to confirm that Scan operation is indeed slower than Query operation.

The experiment was really simple and straightforward. I started by creating a new DynamoDB table in Asia Pacific (Sydney) Region and populating it with 1000 items that were less than 4KB in size each. UUID was used as the primary key for the table as shown below.

Then, using the AWS SDK Python (Boto3) script, I executed the query operation 450 times, each time querying a different item (random UUID primary key) and timing how long it took to get a response. The same approach was used to test the Scan operation by picking a UUID at random from a data list.

The outcome of the test is as follows.

The following are some of the experiment's findings:

The Scan operation is more time-consuming than the Query operation.
The Query operation took less than 65 milliseconds at the 95 percentile, compared to 650 milliseconds for Scan.
In comparison to a query that conducts a straight lookup based on the partition key, response times vary substantially depending on the item and how the scan process works.

In the actual world, a DynamoDB table could have over 1000 items.
So you can see how slow the Scan process is compared to the Query even with only a few items in the table; imagine how slow it will be when there are millions of items in the table to Scan. According to the findings of this experiment, the Query operation outperforms the Scan operation by a large margin.

If you need to Scan, you should consider GSI first. Scan should only be used as a last resort. If this is the case, you should consider if DynamoDB is the best fit for your use case. In addition, if cost is a consideration, Scan will consume more capacity units than Query. Finally, if you've decided that Scan is the way to go, explore the possibility of a parallel Scan instead of normal Scan.

The script for the experiment is available on github. I'd welcome your thoughts on the script, as well as any recommendations for how to improve the experiment. This isn't a full-fledged experiment with a variety of item sizes, network speeds, and other factors. This script can be used as is or modified to run your own tests.

Thanks for reading!

If you enjoyed this article feel free to share it on social media 🙂

Say Hello on: Linkedin | Twitter | Polywork

Blogging: Dev | Hashnode

Github: hseera