Given queries like this
with
user_id, aggregate metrics
from
table
where
date < end_time and date >= start_time
group by
user_id
What should be my sortkey and dist key?
Based on articles that I have read online, a sort key of date
makes most sense since we need to filter out irrelevant data. But I'm not sure how/if I can optimize the grouping on user_id
by adding it to the sortkey or distkey.
A potential problem with adding user_id
to distkey is that because of the severely uneven distribution in that column, certain nodes could take much longer and end up increasing the time taken by the query.
Your sort key criteria sound correct. Be aware that "start_time" and "end_time" in your query need to be literal date or timestamp values for the query optimizer to utilize the table metadata for initial filtering. Also the table needs to analyzed so that the metadata is valid.
As for distribution key you can look for other columns that can act as better dist keys in terms of table skew but still correlate well with user_id to provide performance benefit. If none exist you can make one - I've done this for clients a few times when it is important enough. An example of how this could play out:
That's it. Define a column that keeps each user_id value on a single slice and also keeps the table uniformly distributed.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.