Redshift Tuning Table Design (Baseline)
April 28, 2019
April 28, 2019
Today we are going to take a look at the Tuning Table Design tutorial in the Amazon Redshift Developer Guide. We’re going to test two things though. One, we’re going to follow the tutorial to see about table design but two, we’re going to run on two different clusters. We’re going to see if we see a difference with running the tutorial over compute or storage clusters. The question we’re trying to see is, will we see performance improvements on compute nodes versus storage nodes?
We’ve built two clusters for this test. Each is a 4 node cluster. One set is built with dc2.large and the other built with ds2.xlarge.
|Node Size||vCPU||ECU||RAM (GiB)||Slices Per Node||Storage Per Node||Node Range||Total Capacity|
|ds2.xlarge||4||13||31||2||2 TB HDD||1–32||64 TB|
|dc2.large||2||7||15.25||2||160 GB NVMe-SSD||1–32||5.12 TB|
We are also going to encrypt the database using the default master key of aws/redshift. This is just being mentioned as it might have some effect on the output of our tests.
Also note, we are not going to go into the details of the tutorial itself. For example, there’s no need here to discuss the creation of the tables or the loading of the data. We’re just going to focus on the results.
NOTE: when loading the lineorder table, it took a considerable amount of time.
Following are the create times for each object:
|object||create time on compute||create time on storage|
|redshift_test schema||352 ms||1 s 379 ms|
|redshift_test.part||299 ms||1 s 304 ms|
|redshift_test.supplier||363 ms||771 ms|
|redshift_test.customer||416 ms||2 s|
|redshift_test.dwdate||368 ms||1 s 654 ms|
|redshift_test.lineorder||375 ms||688 ms|
Following are the load times for each table:
|table name||rows||load time on computer||load time on storage|
|redshift_test.customer||3,000,000||28 s 40 ms||27 s 412 ms|
|redshift_test.dwdate||2,556||14 s 546 ms||14 s 85 ms|
|redshift_test.lineorder||600,037,902||16 m 16 s 816 ms||14 m 10 s 328 ms|
|redshift_test.part||1,400,000||19 s 13 ms||21 s 898 ms|
|redshift_test.supplier||1,000,000||19 s 376 ms||18 s 507 ms|
*These were cross region loads. the data was in us-west-2 and we built the test clusters in us-east-1.
Storage use of each table. This query took execution: 18 s 980 ms, fetching: 15 ms on the compute cluster and execution: 18 s 26 ms, fetching: 15 ms on the storage cluster.
Following are how many 1 MB blocks of disk space are used for each table:
Interestingly, on step 2, AWS explains the cluster they created this tutorial with and mention in that, that no two clusters are alike. Specifically, “Your results will be different, even if you use the same cluster configuration. System performance is influenced by many factors, and no two systems will perform exactly the same. ”
Because Redshift caches results of certain types of queries in the memory of the leader node, we are going to disable this as suggested in the tutorial. Now, we will create our baseline for after we update the tables.
The results below are the results of the second execution of each test query provided.
|query1||execution: 6 s 979 ms, fetching: 7 ms||execution: 5 s 699 ms, fetching: 9 ms|
|query2||execution: 5 s 592 ms, fetching: 67 ms||execution: 6 s 927 ms, fetching: 74 ms|
|query3||execution: 6 s 350 ms, fetching: 13 ms||execution: 3 s 60 ms, fetching: 8 ms|
Now that we have a baseline, we’re off to step 3, selecting a sort key.
Advanced Data Engineering Platform for Cleansing, Preprocessing and Analytics