简体   繁体   English

Redshift性能:SQL查询与表规范化

[英]Redshift performance: SQL queries vs table normalization

I'm working on building a redshift database by listening to events from from different sources and pump that data into a redshift cluster. 我正在通过侦听来自不同来源的事件来构建Redshift数据库,并将该数据泵入Redshift集群。

The idea is to use Kinesis firehose to pump data to redshift using COPY command. 这个想法是使用Kinesis firehose通过COPY命令将数据泵送至Redshift。 But I have a dilemma here: I wish to first query some information from redshift using a select query such as the one below: 但是我这里有一个难题:我希望首先使用选择查询(例如以下查询)从redshift查询一些信息:

select A, B, C from redshift__table where D='x' and E = 'y';

After getting the required information from redshift, I will combine that information with my event notification data and issue a request to kinesis. 从redshift获得所需的信息后,我将把该信息与事件通知数据结合起来并向kinesis发出请求。 Kinesis will then do its job and issue the required COPY command. 然后,Kinesis将执行其工作并发出所需的COPY命令。

Now my question is that is it a good idea to repeatedly query redshift like say every second since that is the expected time after which I will get event notifications? 现在我的问题是,像每秒说一次那样反复查询redshift是个好主意,因为那是我将在其后收到事件通知的预期时间?

Now let me describe an alternate scenario: 现在让我描述一个替代方案:

If I normalize my table and separate out some fields into a separate table then, I will have to perform fewer redshift queries with the normalized design (may be once every 30 seconds) 如果我对我的表进行规范化并将某些字段分离到一个单独的表中,那么我将不得不使用规范化的设计执行更少的redshift查询(可能每30秒执行一次)

But the downside of this approach is that once I have the data into redshift, I will have to carry out table joins while performing real time analytics on my redshift data. 但是这种方法的缺点是,一旦我将数据转移到红移中,我将必须执行表联接,同时对我的红移数据执行实时分析。

So I wish to know on a high level which approach would be better: 因此,我希望从更高的角度了解哪种方法更好:

  1. Have a single flat table but query it before issuing a request to kinesis on an event notification. 只有一个平面表,但是在对事件通知发出运动要求之前先进行查询。 There wont be any table joins while performing analytics. 执行分析时不会有任何表联接。

  2. Have 2 tables and query redshift less often. 有2个表,查询redshift的频率降低。 But perform a table join while displaying results using BI/analytical tools. 但是在使用BI /分析工具显示结果时执行表联接。

Which of these 2 do you think is a better option? 您认为这2个中的哪个更好? Let us assume that I will use appropriate sort keys/distribution keys in either cases. 让我们假设在两种情况下我都将使用适当的排序键/分发键。

I'd definitely go with your second option, which involves JOINing with queries. 我绝对会选择您的第二个选项,它涉及与查询进行联接。 That's what Amazon Redshift is good at doing (especially if you have your SORTKEY and DISTKEY set correctly). 这就是Amazon Redshift擅长的事情(尤其是如果您正确设置了SORTKEY和DISTKEY时)。

Let the streaming data come into Redshift in the most efficient manner possible, then join when doing queries. 让流数据以最有效的方式进入Redshift,然后在执行查询时合并。 You'll have a lot less queries that way. 这样,您的查询就会少很多。

Alternatively, you could run a regular job (eg hourly) to batch process the data into a wide table. 或者,您可以运行常规作业(例如每小时一次)以将数据批处理为一个宽表。 It depends how quickly you'll need to query the data after loading. 这取决于加载后查询数据的速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM