如何优化 Spark 以将大量数据写入 S3

Question

I do a fair amount of ETL using Apache Spark on EMR.我在 EMR 上使用 Apache Spark 进行了大量 ETL。

I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.我对获得良好性能所需的大部分调整都相当满意，但我有一项似乎无法弄清楚的工作。

Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.基本上，我正在获取大约 1 TB 的镶木地板数据 - 分布在 S3 中的数万个文件中 - 并添加几列并将其写出按数据的日期属性之一分区 - 同样，在 S3 中格式化的镶木地板。

I run like this:我是这样跑的：

spark-submit --conf spark.dynamicAllocation.enabled=true  --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf  spark.executor.memoryOverhead=5120 --conf  spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>

The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.簇的大小是根据输入数据集的大小动态确定的，num-executors、spark.sql.shuffle.partitions、spark.default.parallelism参数是根据簇的大小计算的。

The code roughly does this:代码大致是这样的：

va df = (read from s3 and add a few columns like timestamp and source file name)

val dfPartitioned = df.coalesce(numPartitions)

val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);

sqlDFProdDedup.repartition($"partition_column")
  .write.partitionBy("partition_column")
  .mode(SaveMode.Append).parquet(outputPath)

When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.当我查看神经节图表时，我在重复数据删除逻辑运行和一些数据混洗时遇到了巨大的资源峰值，但随后实际写入数据只使用了一小部分资源并运行了几个小时。

I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.我不认为主要问题是分区倾斜，因为数据应该公平地分布在所有分区中。

The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set.分区列本质上是一个月中的一天，因此每个作业通常只有 5-20 个分区，具体取决于输入数据集的跨度。 Each partition typically has about 100 GB of data across 10-20 parquet files.每个分区通常在 10-20 个镶木地板文件中包含大约 100 GB 的数据。

I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.我正在设置 spark.sql.files.maxRecordsPerFile 来管理这些输出文件的大小。

So, my big question is: how can I improve the performance here?所以，我的大问题是：我怎样才能提高这里的性能？

Simply adding resources doesn't seem to help much.简单地添加资源似乎没有多大帮助。

I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.我曾尝试使执行程序更大（以减少混洗）并增加每个执行程序的 CPU 数量，但这似乎无关紧要。

Thanks in advance!提前致谢！

Answer 1

Zack, I have a similar use case with 'n' times more files to process on a daily basis. Zack，我有一个类似的用例，每天要处理 'n' 倍的文件。 I am going to assume that you are using the code above as is and trying to improve the performance of the overall job.我将假设您按原样使用上面的代码并试图提高整体工作的性能。 Here are couple of my observations:以下是我的一些观察：

Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process.不确定coalesce(numPartitions)数字实际上是什么以及为什么在重复数据删除过程之前使用它。 Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.您的 spark-submit 显示您正在创建 1600 个分区，这足以开始。
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.如果您要在写入之前重新分区，那么上面的合并可能根本没有好处，因为重新分区会打乱数据。
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow.由于您声称要编写 10-20 个镶木地板文件，这意味着您在工作的最后一部分只使用了 10-20 个内核，这是其运行缓慢的主要原因。 Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high.基于 100 GB 的估计，镶木地板文件的范围从大约 5GB 到 10 GB，这真的很大，我怀疑人们是否能够在本地笔记本电脑或 EC2 机器上打开它们，除非他们使用 EMR 或类似的（如果阅读，则具有巨大的执行程序内存）整个文件或溢出到磁盘），因为内存要求太高。 I will recommend creating parquet files of around 1GB to avoid any of those issues.我会建议创建大约 1GB 的镶木地板文件以避免任何这些问题。

Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel.此外，如果您创建 1GB 镶木地板文件，您可能会加快进程 5 到 10 倍，因为您将使用更多的执行程序/内核来并行编写它们。 You can actually run an experiment by simply writing the dataframe with default partitions.您实际上可以通过简单地使用默认分区编写数据帧来运行实验。

Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call.这让我明白你真的不需要像你想写的那样使用重新分区。partitionBy("partition_date") 调用。 Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written.您的repartition()调用实际上强制数据帧最多只有 30-31 个分区，具体取决于该月的天数，这是驱动写入文件数量的原因。 The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). write.partitionBy("partition_date")实际上是在 S3 分区中写入数据，如果您的数据帧有 90 个分区，它的写入速度将提高 3 倍（3 * 30）。 df.repartition() is forcing it to slow it down. df.repartition()迫使它放慢速度。 Do you really need to have 5GB or larger files?您真的需要 5GB 或更大的文件吗？

Another major point is that Spark lazy evaluation is sometimes too smart.另一个重点是 Spark 惰性求值有时太聪明了。 In your case it will most likely only use the number of executors for the whole program based on the repartition(number) .在您的情况下，它很可能仅根据repartition(number)使用整个程序的执行程序repartition(number) 。 Instead you should try, df.cache() -> df.count() and then df.write() .相反，您应该尝试df.cache() -> df.count() and then df.write() 。 What this does is that it forces spark to use all available executor cores.它的作用是强制 spark 使用所有可用的执行程序核心。 I am assuming you are reading files in parallel.我假设您正在并行读取文件。 In your current implementation you are likely using 20-30 cores.在您当前的实现中，您可能使用 20-30 个内核。 One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores.有一点需要注意，当您使用 r4/r5 机器时，请随时将您的 executor 内存增加到 48G 和 8 核。 I have found 8cores to be faster for my task instead of standard 5 cores recommendation.我发现 8 核对我的任务来说更快，而不是标准的 5 核推荐。
Another pointer is to try ParallelGC instead of G1GC.另一个指针是尝试 ParallelGC 而不是 G1GC。 For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc.对于这样的用例，当您读取 1000 倍的文件时，我注意到它的性能比 G1Gc 好或不差。 Please give it a try.请试一试。

In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file.在我的工作量中，我使用基于coalesce(n)的方法，其中 'n' 给我一个 1GB 的镶木地板文件。 I read files in parallel using ALL the cores available on the cluster.我使用集群上可用的所有内核并行读取文件。 Only during the write part my cores are idle but there's not much you can do to avoid that.只有在写入部分，我的内核才空闲，但您无法避免这种情况。

I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.我不确定spark.sql.files.maxRecordsPerFile如何与coalesce() or repartition() spark.sql.files.maxRecordsPerFile coalesce() or repartition()结合使用，但我发现 1GB 似乎可以用于 pandas、Redshift 频谱、Athena 等。

Hope it helps.希望能帮助到你。 Charu查鲁

Answer 2

Here are some optimizations for faster running.这里有一些优化可以加快运行速度。

(1) File committer - this is how Spark will read the part files out to the S3 bucket. (1) 文件提交者 - 这就是 Spark 将部分文件读取到 S3 存储桶的方式。 Each operation is distinct and will be based upon每个操作都是不同的，并将基于

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

Description描述

This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.这会将文件直接写入零件文件，或者最初将它们加载到临时文件并将它们复制到它们的最终状态零件文件中。

(2) For file size you can derive it based upon getting the average number of bytes per record. (2) 对于文件大小，您可以根据获取每条记录的平均字节数得出它。 Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs.下面我正在计算每条记录的字节数以计算 1024 MB 的记录数。 I would try it first with 1024MBs per partition, then move upwards.我会先尝试每个分区 1024MB，然后向上移动。

import org.apache.spark.util.SizeEstimator

val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1

(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. (3) [我没试过] EMR Committer - 如果您使用 EMR 5.19 或更高版本，因为您正在输出 Parquet。 You can set the Parquet optimized writer to TRUE.您可以将 Parquet 优化编写器设置为 TRUE。

spark.sql.parquet.fs.optimized.committer.optimization-enabled true

如何优化 Spark 以将大量数据写入 S3

问题描述

2 个解决方案

解决方案1
3 2020-04-16 01:19:27

解决方案2
0 2020-01-08 02:20:29

如何优化 Spark 以将大量数据写入 S3

问题描述

2 个解决方案

解决方案1 3 2020-04-16 01:19:27

解决方案2 0 2020-01-08 02:20:29

解决方案1
3 2020-04-16 01:19:27

解决方案2
0 2020-01-08 02:20:29