Redshift Tuning Table Design (Distribution Styles)
May 20, 2019
May 20, 2019
There are three distribution styles that can be defined in Amazon Redshift; Key, All and Even.
Because Redshift works in a distributed fashion where one leader node manages communication and task distribution amongst a set of worker nodes, the way you distribute your data across the back end cluster matters depending on how you query your data. For performance, the idea is to define distribution styles per table that can best distribute the load to each worker to maximize the compute power of each of them so that when they return data to the master, it is not bogged down with having to do a majority of the work. Most distributed systems work in this manner.
With Key distribution, the data is distributed amongst the cluster nodes in a way that puts items with similar keys on to the same node. This is called ‘collocating’.
With All distribution, a copy of the entire table is distributed across all backend nodes. So, if you have a cluster of 4 nodes, and a 5 meg table, you’ll have a 5 meg table on each node.
With Even distribution, rows are distributed amongst back end nodes in a round-robin fashion. Even distribution is the default distribution on a table. This is the ‘fallback’ distirbution style when it can’t really be determined if Key or All would suffice.
Because the Redshift query optimizer “redistributes the rows to the compute nodes as needed to perform any joins and aggregations” you want to make sure that data is where it needs to be to avoid unecessary delay because of shuffling. This is where the query optimizer has to move data around the backend compute nodes to get things in the right place to execute the query. In addition, you want to make sure that matching data or data needed for a query is on the same node slice.
To determine if your query is going to move a lot of data around before you execute it, you can run an ‘explain‘. To run an ‘explain’ you would just type the word ‘explain’ at the very beginning of your query.
explain select sum(lo_revenue), d_year, p_brand1 from redshift_test.lineorder, redshift_test.dwdate, redshift_test.part, redshift_test.supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_category = 'MFGR#12' and s_region = 'AMERICA' group by d_year, p_brand1 order by d_year, p_brand1;
Then, you are going to look for two things in the output:
- Items prefixed with DS_BCAST
2. Items prefixed with DS_DIST
DS_BCAST means that data will be broadcast to every slice.
The tutorial uses query2 as the example for the explain and you can see DS_BCAST shows twice. The first one:
-> XN Hash Join DS_BCAST_INNER (cost=30634.67..37198397649.55 rows=4612829 width=20)
and the second one:
-> XN Hash Join DS_BCAST_INNER (cost=17640.00..13453758507.64 rows=24001516 width=24)
Based on this information, we can now define our distribution styles for the tables:
|Table name||Sort Key||Distribution Style|
Now it’s time to look at compression encoding.
Advanced Data Engineering Platform for Cleansing, Preprocessing and Analytics