简体   繁体   English

带有AWS Glue和数据管道的ETL架构

[英]ETL architecture with AWS Glue and Data Pipeline

I'm trying to decide whether to use AWS Glue or Amazon Data Pipeline for our ETL. 我正在尝试决定是否对我们的ETL使用AWS Glue或Amazon Data Pipeline。 I need to incrementally copy several tables to Redshift. 我需要将几个表逐步复制到Redshift。 Almost all tables need to be copied with no transformation. 几乎所有表都无需复制即可复制。 One table requires a transformation that could be done using Spark. 一个表需要使用Spark可以完成的转换。

Based on my understanding from these two services, the best solution is to use a combination of the two. 根据我对这两项服务的了解,最好的解决方案是将两者结合使用。 Data Pipeline can copy everything to S3. 数据管道可以将所有内容复制到S3。 From there, if no transformation is needed, Data Pipeline can use Redshift COPY to move the data to Redshift. 从那里开始,如果不需要转换,则数据管道可以使用Redshift COPY将数据移至Redshift。 Where a transformation is required, a Glue job can apply the transformation and copy the data to Redshift. 需要转换的地方,Glue作业可以应用转换并将数据复制到Redshift。

Is this a sensible strategy or am I misunderstanding the applications of these services? 这是明智的策略,还是我误解了这些服务的应用?

I'm guessing it's long pass the project deadline but for people looking at this: 我猜它已经过了项目截止日期了,但是对于那些正在看这个的人来说:

Use only AWS Glue. 仅使用AWS Glue。 You can define Redshift as a both source and target connectors , meaning that you can read from it and dump into it. 您可以将Redshift定义为 连接器目标 连接器 ,这意味着您可以从中读取并转储到其中。 Before you do that, however, you'll need to use aa Crawler to create Glue-specific schema. 但是,在执行此操作之前,您需要使用Crawler创建特定于Glue的架构。

All of this can be also done through only Data Pipeline with SqlActivity (s) although setting up everything might take significantly longer and not that much cheaper. 所有这些也可以仅通过具有SqlActivity数据管道来完成,尽管设置所有内容可能会花费更长的时间,而且成本并不便宜。

rant: I'm honestly surprised how AWS focused solely on big data solutions without providing a decent tool for small/medium/large data sets. rant:我实在感到惊讶,AWS如何仅专注于大数据解决方案,而没有为中/小型/大型数据集提供合适的工具。 Glue is an overkill and Data Pipeline is cumbersome/terrible for usage. 胶水太过分了,数据管道使用起来很麻烦/很糟糕。 There should be a simple SQL-type Lambda! 应该有一个简单的SQL型Lambda!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM