简体繁体 English

Apache Spark AWS Glue 作业与 Hadoop 集群上的 Spark 在存储桶之间传输数据

[英]Apache Spark AWS Glue job versus Spark on Hadoop cluster for transferring data between buckets

原文 2022-10-09 09:18:41 7 2 apache-spark/ amazon-s3/ aws-glue

Let's say I need to transfer data between two S3 buckets in a manner of ETL and perform an easy transformation on the data during the transportation process (taking only part of the columns and filtering by ID).假设我需要以 ETL 的方式在两个 S3 存储桶之间传输数据，并在传输过程中对数据进行简单的转换（只取部分列并按 ID 过滤）。 The data is parquet files and its size change between 1GB to 100GB.数据是镶木地板文件，其大小在 1GB 到 100GB 之间变化。

What should be more efficient in terms of speed and cost - using an Apache Spark Glue job, or Spark on the Hadoop cluster with X machines?在速度和成本方面应该更有效 - 使用 Apache Spark Glue 作业，或使用 X 机器在 Hadoop 集群上使用 Spark？

2 个解决方案

The answer to this is basically the same for any serverless (Glue)/non-serverless (EMR) service equivalents.对于任何无服务器（Glue）/非无服务器（EMR）服务等价物，答案基本相同。

The first should be faster to set up, but will be less configurable and probably more expensive.第一个应该设置起来更快，但可配置性较差并且可能更昂贵。 The second will give you more options for optimization (performance and cost) but you should not forget to include the cost of managing the service yourself.第二个将为您提供更多优化选项（性能和成本），但您不应忘记包括自己管理服务的成本。 You can use AWS pricing calculator if you need some price estimate upfront.如果您需要预先估算价格，可以使用 AWS 定价计算器。

I would definitely start with Glue and move to something more complicated if problems arise.如果出现问题，我肯定会从 Glue 开始，然后转向更复杂的东西。 Also, don't forget that there is serverless EMR now also available.另外，不要忘记现在也可以使用无服务器 EMR。

I read this question when determining if it was worthwhile to switch from AWS Glue to AWS EMR.在确定是否值得从 AWS Glue 切换到 AWS EMR 时，我读到了这个问题。

With configurable EC2 SPOT instances on EMR we drastically reduced a previous Glue job that read 1GB-4TB of csv uncompressed csv data.通过 EMR 上的可配置 EC2 SPOT 实例，我们大大减少了之前读取 1GB-4TB csv 未压缩 csv 数据的 Glue 作业。 We were able to use spots instances to leverage much larger and faster Graviton processor EC2s that could load more data into RAM reducing spills to disk.我们能够使用 spot 实例来利用更大更快的 Graviton 处理器 EC2，这些 EC2 可以将更多数据加载到 RAM 中，减少溢出到磁盘。 Another benefit was that got rid of the dynamic frames which is very beneficial when you do not know a schema, but was overhead that we did not need.另一个好处是摆脱了动态框架，这在您不知道模式时非常有用，但是我们不需要的开销。 In addition the spot instances which are larger than what is provided by AWS Glue reduced our time to run but not too much.此外，比 AWS Glue 提供的实例更大的 Spot 实例减少了我们的运行时间，但不会太多。 More importantly we cut our costs by 40-75%, yes that is even with the EC2 + EBS + EMR overhead cost per EC2 instance.更重要的是，我们将成本降低了 40-75%，是的，即使每个 EC2 实例的 EC2 + EBS + EMR 开销成本也是如此。 We went from $25-250 dollars a day on Glue to $2-$60 on EMR.我们从 Glue 每天 25-250 美元变成了 EMR 每天 2-60 美元。 Costs monthly for this process was $1600 in AWS Glue and now is <$500.此过程的每月成本在 AWS Glue 中为 1600 美元，现在低于 500 美元。 We run EMR as job_flow_run and TERMINATE when idle so that it essentially acts like Glue serverless.我们将 EMR 作为 job_flow_run 运行，并在空闲时终止运行，因此它本质上就像 Glue serverless 一样。

We did not go with EMR Serverless because there was no spot instances which was probably the biggest benefit.我们没有使用 EMR Serverless go，因为没有 spot 实例，这可能是最大的好处。

The only problem is that we did not switch earlier.唯一的问题是我们没有早点切换。 We are now moving all AWS Glue jobs to AWS EMR.我们现在正在将所有 AWS Glue 作业转移到 AWS EMR。