简体繁体 English

AWS Lambda 是否优于 AWS Glue 作业？

[英]Is AWS Lambda preferred over AWS Glue Job?

原文 2020-08-26 14:29:53 8 3 amazon-web-services/ aws-lambda/ aws-glue

In AWS Glue job, we can write some script and execute the script via job.在 AWS Glue 作业中，我们可以编写一些脚本并通过作业执行脚本。

In AWS Lambda too, we can write the same script and execute the same logic provided in above job.在 AWS Lambda 中，我们也可以编写相同的脚本并执行上述作业中提供的相同逻辑。

So, my query is not whats the difference between AWS Glue Job vs AWS Lambda, BUT iam trying to undestand when AWS Glue job should be preferred over AWS Lambda, especially while when both does the same job?所以，我的查询不是 AWS Glue 作业与 AWS Lambda 之间的区别，而是我想了解什么时候应该优先选择 AWS Glue 作业而不是 AWS Lambda，尤其是当两者都做同样的工作时？ If both does the same job, then ideally I would blindly prefer using AWS Lambda itself, right?如果两者都做同样的工作，那么理想情况下我会盲目地更喜欢使用 AWS Lambda 本身，对吗？

Please try to understand my query..请尝试理解我的查询..

3 个解决方案

The answer to this can involve some foundational design decisions.这个问题的答案可能涉及一些基础设计决策。 What is this job doing?这份工作是做什么的？ What kind of data are you dealing with?你在处理什么样的数据？ Is there a decision to be made whether the task should be executed in a batch or event oriented paradigm?是否需要决定任务应该以批处理还是面向事件的范例执行？

Batch批

This may be necessary or desirable because the task:这可能是必要的或可取的，因为任务：

Is being done over large monolithic data (eg, binary).正在对大型单片数据（例如，二进制）进行处理。
Relies on context of multiple records in a dataset such that they must be loaded into a single job.依赖于数据集中多个记录的上下文，因此必须将它们加载到单个作业中。
Order matters.订单很重要。

I feel like just as often I see batch handling chosen by default because "this is the way we've always done it" but breaking from this approach could be worth consideration.我觉得就像我经常看到默认选择批处理一样，因为“这是我们一直这样做的方式”，但打破这种方法可能值得考虑。

Glue is built for batch operations. Glue 是为批处理操作而构建的。 With a current maximum execution time of 15 minutes and maximum memory of 10gb, Lambda has become capable of processing fairly large datasets in a single execution, as well.当前最长执行时间为 15 分钟，最大 memory 为 10gb，Lambda 也能够在单次执行中处理相当大的数据集。 It can be difficult to pin down a direct cost comparison without specifics of the workload.在没有具体工作负载的情况下，很难确定直接的成本比较。 When it comes to development, I feel that Lambda has the edge as far as tooling to build, test, deploy.在开发方面，我觉得 Lambda 在构建、测试和部署工具方面具有优势。

Event事件

In the case where your data consists of a set of records, it might behoove you to parse and "stream" them into Lambda. Consider a flow like:如果您的数据由一组记录组成，您可能需要将它们解析并“流”到 Lambda 中。考虑如下流程：

CSV lands in S3. CSV 登陆 S3。
S3 event triggers Lambda. S3 事件触发 Lambda。
Lambda reads and parses CSV into discrete events, submits to another Lambda or publishes to SNS for downstream processing. Lambda读取CSV并解析成离散事件，提交给另一个Lambda或者发布到SNS进行下游处理。 Concurrent instances of this Lambda can be employed to speed up ingest, where each instance is responsible for certain lines of the S3 object.此 Lambda 的并发实例可用于加速摄取，其中每个实例负责 S3 object 的某些行。

This pushes all logic and error handling, as well as resources required, to the level of individual event/record level.这将所有逻辑和错误处理以及所需的资源推送到单个事件/记录级别。 Often mechanisms such as dead-letter queues are employed for remediation.通常采用死信队列等机制进行补救。 While context of a given container persists across invocations - assuming the container has not been idle and torn down - Lambda should generally be considered stateless such that the processing of an event/record is thought of as occurring within its own scope, outside that of others in the dataset.虽然给定容器的上下文在调用中持续存在 - 假设容器没有空闲和拆除 - Lambda 通常应该被认为是无状态的，这样事件/记录的处理被认为发生在它自己的 scope 中，而不是其他的在数据集中。

Lambda has a lifetime of fifteen minutes. Lambda 的生命周期为十五分钟。 It can be used to trigger a glue job as an event based acttivity.它可用于触发粘合作业作为基于事件的活动。 That is, when a file lands in S3 for example, we can have an event trigger which can run a glue job.也就是说，例如，当文件进入 S3 时，我们可以有一个事件触发器来运行粘合作业。 Glue is a managed services for all data processing. Glue 是一种用于所有数据处理的托管服务。

If the data is very low maybe you can do it in lambda, but for some reason the process goes beyond fifteen minutes, then data processing would fail.如果数据非常少，也许您可以在 lambda 中进行，但是由于某种原因，该过程超过了 15 分钟，然后数据处理将失败。

Additional points:附加点：

Per this source and Lambda FAQ and Glue FAQ 根据此来源和Lambda 常见问题解答和胶水常见问题解答

Lambda can use a number of different languages (Node.js, Python, Go, Java, etc.) vs. Glue can only execute jobs using Scala or Python code. Lambda 可以使用多种不同的语言（Node.js、Python、Go、Java 等），而 Glue 只能使用 Scala 或 Python 代码执行作业。

Lambda can execute code from triggers by other services (SQS, Kaftka, DynamoDB, Kinesis, CloudWatch, etc.) vs. Glue which can be triggered by lambda events, another Glue jobs, manually or from a schedule. Lambda 可以从其他服务（SQS、Kaftka、DynamoDB、Kinesis、CloudWatch 等）的触发器执行代码，而 Glue 可以由 lambda 事件、另一个 Glue 作业、手动或从计划触发。

Lambda runs much faster for smaller tasks vs. Glue jobs which take longer to initialize due to the fact that it's using distributed processing. Lambda 对于较小的任务运行得更快，而 Glue 作业由于使用分布式处理而需要更长的时间来初始化。 That being said, Glue leverages its parallel processing to run large workloads faster than Lambda.话虽如此，Glue 利用其并行处理比 Lambda 更快地运行大型工作负载。

Lambda looks to require more complexity/code to integrate into data sources (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.) while Glue can easily integrate with these. Lambda 看起来需要更多复杂性/代码来集成到数据源（Redshift、RDS、S3、在 ECS 实例上运行的数据库、DynamoDB 等），而 Glue 可以轻松地与这些集成。 However, with the addition of Step Functions, multiple lambda functions can be written and ordered sequentially due reduce complexity and improve modularity where each function could integrate into a aws service (Redshift, RDS, S3, DBs running on ECS instances, DynamoDB, etc.)然而，通过添加 Step Functions，多个 lambda 函数可以顺序编写和排序，因为降低了复杂性并提高了模块化，其中每个函数都可以集成到 aws 服务（Redshift、RDS、S3、在 ECS 实例上运行的数据库、DynamoDB 等）。 )

Glue looks to have a number of additional components, such as Data Catalog which is a central metadata repository to view your data, a flexible scheduler that handles dependency resolution/job monitoring/retries, AWS Glue DataBrew for cleaning and normalizing data with a visual interface, AWS Glue Elastic Views for combining and replicating data across multiple data stores, AWS Glue Schema Registry to validate streaming data schema. Glue 看起来有许多附加组件，例如 Data Catalog，它是一个用于查看数据的中央元数据存储库，一个灵活的调度程序，用于处理依赖项解析/作业监控/重试，AWS Glue DataBrew 用于使用可视化界面清理和规范化数据, AWS Glue Elastic Views 用于跨多个数据存储组合和复制数据，AWS Glue Schema Registry 用于验证流数据架构。

There are other examples I am missing, so feel free to comment and I can update.我还缺少其他示例，因此请随时发表评论，我可以更新。