简体繁体 English

AWS Glue 作为 ETL 工具？

[英]AWS Glue as a ETL tool?

原文 2020-06-30 11:17:18 5 1 amazon-web-services/ apache-nifi/ aws-glue

Why AWS claims Glue as a ETL tool?为什么 AWS 声称 Glue 是一种 ETL 工具？ We need to code everything to pull data, no inbuilt functionality provided by Glue.我们需要编写所有代码来提取数据，Glue 没有提供内置功能。 Any benefits of using Glue instead of Nifi or some other ingestion tools?使用 Glue 而不是 Nifi 或其他一些摄取工具有什么好处？

1 个解决方案

Glue is a good ETL tool within AWS. Glue 是 AWS 中一个很好的 ETL 工具。 Especially for big data work loads.特别是对于大数据工作负载。 After all it is running on spark.毕竟它是在火花上运行的。

Glue does have the ability to produce some basic automated transformation code -> Move data from A to B and remap column names etc. Glue 确实能够生成一些基本的自动转换代码 -> 将数据从 A 移动到 B 并重新映射列名等。

However, it's the flexibility to write custom code that really sets it apart.然而，真正让它与众不同的是编写自定义代码的灵活性。 Using the Glue code editor, or the Pycharm IDE, you can script any transformations you need using pyspark and/or scala. Using the Glue code editor, or the Pycharm IDE, you can script any transformations you need using pyspark and/or scala.

The benefits of Glue are really gained when it is used in conjunction with other AWS services.将 Glue 与其他 AWS 服务结合使用时，它的优势才能真正获得。 The Glue Data Catalog is shared with Athena and even AWS EMR, so you end up with a central point for your big data ecosystem. Glue 数据目录与 Athena 甚至 AWS EMR 共享，因此您最终获得了大数据生态系统的中心点。

One limitation of Glue I have found is writing large datasets to MS SQL Server (10 million rows+).我发现 Glue 的一个限制是将大型数据集写入 MS SQL 服务器（1000 万行以上）。 Glue uses JDBC drivers, and as of 2020, there is yet to be a Microsoft JDBC connection that avails of bulk copy. Glue 使用 JDBC 驱动程序，截至 2020 年，还没有可用于批量复制的 Microsoft JDBC 连接。 So, effectively you are writing an insert statement for each row.因此，实际上您正在为每一行编写一个插入语句。 Therefore, performance can suffer once you get into the 10s of millions of rows currently.因此，一旦您当前进入数以千万计的行，性能就会受到影响。