简体   繁体   English

使用 WHERE 和 GROUP BY 为 AWS Redshift 表选择 sortkeys 和 distkeys

[英]Selecting sortkeys and distkeys for an AWS Redshift table with WHERE and GROUP BY

Given queries like this给出这样的查询

with
  user_id, aggregate metrics
from
  table
where
  date < end_time and date >= start_time
group by
  user_id

What should be my sortkey and dist key?我的 sortkey 和 dist key 应该是什么?

Based on articles that I have read online, a sort key of date makes most sense since we need to filter out irrelevant data.根据我在网上阅读的文章, date的排序键最有意义,因为我们需要过滤掉不相关的数据。 But I'm not sure how/if I can optimize the grouping on user_id by adding it to the sortkey or distkey.但我不确定如何/是否可以通过将user_id添加到 sortkey 或 distkey 来优化 user_id 上的分组。

A potential problem with adding user_id to distkey is that because of the severely uneven distribution in that column, certain nodes could take much longer and end up increasing the time taken by the query.user_id添加到 distkey 的一个潜在问题是,由于该列中的分布严重不均匀,某些节点可能需要更长的时间并最终增加查询所用的时间。

Your sort key criteria sound correct.您的排序关键标准听起来是正确的。 Be aware that "start_time" and "end_time" in your query need to be literal date or timestamp values for the query optimizer to utilize the table metadata for initial filtering.请注意,查询中的“start_time”和“end_time”需要是文字日期或时间戳值,以便查询优化器利用表元数据进行初始过滤。 Also the table needs to analyzed so that the metadata is valid.还需要分析该表,以便元数据有效。

As for distribution key you can look for other columns that can act as better dist keys in terms of table skew but still correlate well with user_id to provide performance benefit.至于分布键,您可以寻找其他列,这些列可以在表倾斜方面充当更好的分布键,但仍与 user_id 关联良好以提供性能优势。 If none exist you can make one - I've done this for clients a few times when it is important enough.如果不存在,你可以做一个——当它足够重要时,我已经为客户做了几次。 An example of how this could play out:这是如何发挥作用的一个例子:

  1. Create a new column in your table, lets call it __user_id_percentile (I like starting "artificial" columns with double underbar to keep them distinguished from true data columns)在你的表中创建一个新列,我们称之为 __user_id_percentile(我喜欢用双下划线开始“人工”列,以使其与真实数据列区分开来)
  2. Populate this column with values 1-100 (or 1-10,000 if you have a large cluster) such that each user_id value corresponds with only one __user_id_percentile value AND that the number of rows per __user_id_percentile is approximately equal用 1-100 的值填充此列(如果集群较大,则为 1-10,000),这样每个 user_id 值仅对应一个 __user_id_percentile 值,并且每个 __user_id_percentile 的行数大致相等
  3. Make __user_id_percentile the dist key of the table - this will lead to balanced distribution of the table and each value of user_id only existing on a single slice使 __user_id_percentile 成为表的 dist key - 这将导致表的平衡分布和 user_id 的每个值仅存在于单个切片上
  4. Add __user_id_percentile to your group by list - "group by __user_id_percentile, user_id".将 __user_id_percentile 添加到您的分组列表 - “group by __user_id_percentile, user_id”。 You don't need to include this new column in the select list and it won't affect the output of the query as long as no user_id exists in 2 or more __user_id_percentiles您不需要在 select 列表中包含此新列,只要 user_id 不存在于 2 个或更多 __user_id_percentiles 中,它就不会影响查询的 output
  5. You will likely want to keep the user_id to __user_id_percentile mapping in a table so that your ETL processes can populate this value quickly.您可能希望在表中保留 user_id 到 __user_id_percentile 的映射,以便您的 ETL 过程可以快速填充此值。 Also you may need to update this mapping if new data starts to skew the distribution but this is a fairly simple update (and vacuum) process that only needs to run rarely此外,如果新数据开始扭曲分布,您可能需要更新此映射,但这是一个相当简单的更新(和真空)过程,只需要很少运行

That's it.而已。 Define a column that keeps each user_id value on a single slice and also keeps the table uniformly distributed.定义一个列,将每个 user_id 值保存在单个切片上,并使表保持均匀分布。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM