简体   繁体   中英

Selecting sortkeys and distkeys for an AWS Redshift table with WHERE and GROUP BY

Given queries like this

with
  user_id, aggregate metrics
from
  table
where
  date < end_time and date >= start_time
group by
  user_id

What should be my sortkey and dist key?

Based on articles that I have read online, a sort key of date makes most sense since we need to filter out irrelevant data. But I'm not sure how/if I can optimize the grouping on user_id by adding it to the sortkey or distkey.

A potential problem with adding user_id to distkey is that because of the severely uneven distribution in that column, certain nodes could take much longer and end up increasing the time taken by the query.

Your sort key criteria sound correct. Be aware that "start_time" and "end_time" in your query need to be literal date or timestamp values for the query optimizer to utilize the table metadata for initial filtering. Also the table needs to analyzed so that the metadata is valid.

As for distribution key you can look for other columns that can act as better dist keys in terms of table skew but still correlate well with user_id to provide performance benefit. If none exist you can make one - I've done this for clients a few times when it is important enough. An example of how this could play out:

  1. Create a new column in your table, lets call it __user_id_percentile (I like starting "artificial" columns with double underbar to keep them distinguished from true data columns)
  2. Populate this column with values 1-100 (or 1-10,000 if you have a large cluster) such that each user_id value corresponds with only one __user_id_percentile value AND that the number of rows per __user_id_percentile is approximately equal
  3. Make __user_id_percentile the dist key of the table - this will lead to balanced distribution of the table and each value of user_id only existing on a single slice
  4. Add __user_id_percentile to your group by list - "group by __user_id_percentile, user_id". You don't need to include this new column in the select list and it won't affect the output of the query as long as no user_id exists in 2 or more __user_id_percentiles
  5. You will likely want to keep the user_id to __user_id_percentile mapping in a table so that your ETL processes can populate this value quickly. Also you may need to update this mapping if new data starts to skew the distribution but this is a fairly simple update (and vacuum) process that only needs to run rarely

That's it. Define a column that keeps each user_id value on a single slice and also keeps the table uniformly distributed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM