简体   繁体   English

根据数据类型将来自AWS Kinesis的数据放入不同的存储桶中

[英]Put data from AWS Kinesis into different buckets based on data type

I've followed the setup described in this tutorial to configure a data pipeline from Aurora all the way to redshift. 我遵循了教程中描述的设置,以配置从Aurora到Redshift的数据管道。 I've got this working perfectly for one table eg Sales. 我已经将此功能完美地用于一张桌子,例如Sales。

However now I want to expand things so that I can bring in data from other tables as well eg Products and Categories such that each data type will end up in a separate table in Redshift ie Redshift should have a Sales table and a Product table in addition to a Categories table. 但是,现在我想扩展内容,以便可以从其他表(例如“产品”和“类别”)中引入数据,以使每种数据类型最终都可以在Redshift中的单独表中找到,即Redshift应该另外有一个Sales表和一个Product表。到类别表。

How do I do this with Kinesis/S3/Redshift setup? 我该如何使用Kinesis / S3 / Redshift设置?

Redshift is able to bring data in from one S3 location only. Redshift仅能从一个S3位置引入数据。 Similarly Kinesis can be configured to put data into one S3 location only. 同样,可以将Kinesis配置为仅将数据放入一个S3位置。 I'm trying to find a way to take my records from kinesis based on data type such that they go into different S3 locations so I can pull them into separate Redshift tables. 我正在尝试找到一种方法,根据数据类型从运动学中提取记录,以便将它们放入不同的S3位置,以便可以将它们拉到单独的Redshift表中。

The obvious solution is to have more than one stream each one corresponding to a data type but I think this will be expensive. 显而易见的解决方案是使每个流对应一个数据类型具有多个流,但是我认为这样做会很昂贵。 What options are there to do this? 有什么选择可以做到这一点?

Good news. 好消息。 in Kinesis Data Firehose you pay only for the amount of data your pipeline is processing, plus the data conversions (if applicable). 在Kinesis Data Firehose中,您只需为管道正在处理的数据量以及数据转换(如果适用)付费。 So you can have two separate streams and it shouldn't be more expensive than a single one. 因此,您可以有两个独立的流,它不应该比单个流贵。

Regarding Redshift Spectrum, you can actually bring data from as many locations as you need. 关于Redshift Spectrum,实际上您可以根据需要从任意多个位置带来数据。 If you look at the post you were linking, there is a create table statement like this 如果您查看所链接的帖子,则会有一个像这样的create table语句

    CREATE EXTERNAL TABLE IF NOT EXISTS spectrum_schema.ecommerce_sales(
  ItemID int,
  Category varchar,
  Price DOUBLE PRECISION,
  Quantity int,
  OrderDate TIMESTAMP,
  DestinationState varchar,
  ShippingType varchar,
  Referral varchar)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://{BUCKET_NAME}/CDC/'

On that statement, the last line references the location of the S3 files to include in the table. 关于该语句,最后一行引用了要包括在表中的S3文件的位置。 You would configure several streams, one per table/S3 location, but you can use a single Redshift cluster to query all your tables. 您将配置多个流,每个表/ S3位置一个,但是您可以使用单个Redshift集群查询所有表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM