简体繁体 English

AWS Glue 与 EMR 无服务器

[英]AWS Glue vs EMR Serverless

原文 2021-12-12 08:10:25 5 2 amazon-web-services/ amazon-emr/ aws-glue/ emr-serverless

Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very promising service.最近，AWS 发布了 Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - 新的非常有前途的服务。

From my understanding - AWS Glue is a managed service on top of Apache Spark (for transformation layer).据我了解 - AWS Glue 是 Apache Spark（用于转换层）之上的托管服务。 AWS EMR is mostly used for Apache Spark as well. AWS EMR 也主要用于 Apache Spark。 So EMR Serverless (for Apache Spark) looks like is something pretty much similar to AWS Glue.所以 EMR Serverless（对于 Apache Spark）看起来与 AWS Glue 非常相似。

Right now I have one question in my mind - what is the core difference from AWS Glue and when to choose EMR Serverless over Glue?现在我有一个问题 - 与 AWS Glue 的核心区别是什么以及何时选择 EMR Serverless 而不是 Glue？

Potentially EMR Serverless, may be even a part of AWS Glue ecosystem for transformation layer?潜在的 EMR Serverless，甚至可能是 AWS Glue 生态系统的一部分，用于转换层？ Maybe AWS is going to replace the transformation layer in AWS Glue with EMR Serverless, and then it may make sense.也许 AWS 会用 EMR Serverless 代替 AWS Glue 中的转换层，然后它可能会有意义。 AWS Glue will play a role of ETL Overlay, Metastore with EMR Serverless as processing layer. AWS Glue 将扮演 ETL Overlay 的角色，Metastore 以 EMR Serverless 作为处理层。

2 个解决方案

I'll give you my two cents about this because I've been wondering the same thing.我会给你两分钱，因为我一直在想同样的事情。

Glue胶水

As per AWS documentation , AWS Glue is " Simple, scalable, and serverless data integration ".根据AWS 文档，AWS Glue 是“简单、可扩展且无服务器的数据集成”。 Glue can be used for a variety of things: as a metadata repository, automatic schema discovery, code generation, and run ETL pipelines to prepare data. Glue 可用于多种用途：作为元数据存储库、自动模式发现、代码生成以及运行 ETL 管道以准备数据。 Glue takes care of providing and managing the computation resources needed to run your data pipelines. Glue 负责提供和管理运行数据管道所需的计算资源。 Glue is a serverless service, so you don't need to create and manage the infrastructure, because Glue does it for you. Glue 是一项无服务器服务，因此您无需创建和管理基础架构，因为 Glue 会为您完成。

If we focus only on the processing feature and discard the Glue-specific features (schema discovery, code generation, etc) then EMR Serverless and Glue services look almost identical.如果我们只关注处理功能而放弃 Glue 特定的功能（模式发现、代码生成等），那么 EMR Serverless 和 Glue 服务看起来几乎相同。 One of the key advantages of both services is the ability to run Spark or Hive serverless applications.这两种服务的主要优势之一是能够运行 Spark 或 Hive 无服务器应用程序。

What advantage will EMR Serverless have over Glue Spark jobs?与 Glue Spark 作业相比，EMR Serverless 有什么优势？

To run Glue, you must either specify MaxCapacity (for Glue version 1.0 or earlier jobs) or Worker type and the Number of workers (for Glue version 2.0 jobs).要运行 Glue，您必须指定MaxCapacity （对于 Glue 版本 1.0 或更早的作业）或Worker type和Number of workers （对于 Glue 版本 2.0 作业）。 Both options assume, first, that there is some understanding of the data and workload per cluster, and second, that the workload during job execution will be uniform, ie, there will be no over- or under- utilization of the provisioned resources.这两个选项首先假设对每个集群的数据和工作负载有一定的了解，其次，作业执行期间的工作负载将是统一的，即不会过度或不足地利用所提供的资源。

EMR Serverless EMR 无服务器

EMR Serverless is a new deployment option for AWS EMR. EMR Serverless 是 AWS EMR 的新部署选项。 With EMR Serverless, you don't need to configure, optimize, protect, or manage clusters to run applications on these platforms.使用 EMR Serverless，您无需配置、优化、保护或管理集群即可在这些平台上运行应用程序。 EMR Serverless helps you avoid over- or under-allocation of resources to process jobs at the individual stage level. EMR Serverless 可帮助您避免在各个阶段级别处理作业的资源分配过多或不足。

EMR Serverless automatically identifies the resources needed by jobs, provisions those resources to run the jobs, and releases them when the jobs are completed. EMR Serverless 自动识别作业所需的资源，配置这些资源以运行作业，并在作业完成时释放它们。 In cases where applications require a response within seconds, such as interactive data analysis, the engineer can pre-initialize the necessary resources during application creation.在应用程序需要在几秒钟内做出响应的情况下，例如交互式数据分析，工程师可以在应用程序创建期间预先初始化必要的资源。 This provides easy initialization, fast job startup, automatic capacity management, and simple cost control.这提供了简单的初始化、快速的作业启动、自动容量管理和简单的成本控制。

More info: https://luminousmen.com/post/emr-serverless-a-400level-guide更多信息： https://luminousmen.com/post/emr-serverless-a-400level-guide

AWS Glue is a data integration service and ETL. AWS Glue 是一项数据集成服务和 ETL。 Completely different service than EMR Analytics.与 EMR Analytics 完全不同的服务。

AWS Glue can be used as metadata store (table schema) for EMR and run integration jobs to prepare data (eg for the EMR). AWS Glue 可用作 EMR 的元数据存储（表模式）并运行集成作业以准备数据（例如，用于 EMR）。 There are are data integration jobs and workflows.有数据集成作业和工作流。 At least that's the intention to make the jobs limited, but simpler to manage.至少这是限制工作的意图，但更易于管理。

EMR is much more (and very different). EMR 更多（并且非常不同）。 In theory the EMR could as well run the python data integration jobs in batch on top of a Spark cluster, but you could run any jobs inside a Spark cluster.理论上，EMR 也可以在 Spark 集群上批量运行 python 数据集成作业，但您可以在 Spark 集群内运行任何作业。 EMR is more an analytics tool and processing tool. EMR 更像是一种分析工具和处理工具。 It is not limited to Spark processing of python batch jobs, you can use different frameworks.它不限于 python 批处理作业的 Spark 处理，您可以使用不同的框架。 Though EMR serverless docs mention only Spark and Hive queries, you have much more control over the processing job.尽管 EMR 无服务器文档仅提及 Spark 和 Hive 查询，但您可以更好地控制处理工作。

If anything compares to the EMR service, it's Athena, which is something like EMR serverless with Spark and Presto and on its own network.如果有任何东西可以与 EMR 服务相比，那就是 Athena，它类似于带有 Spark 和 Presto 并在其自己的网络上的 EMR 无服务器。