Apache Spark到S3上传性能问题

Question

I'm seeing a major performance issue when Apache Spark uploads its results to S3. 当Apache Spark将其结果上传到S3时，我发现了一个主要的性能问题。 As per my understanding it goes these steps... 根据我的理解，这些步骤......

Output of final stage is written to _temp/ table in HDFS and the same is moved into "_temporary" folder inside the specific S3 folder. 最终阶段的输出被写入HDFS中的_temp/ table，同样被移动到特定S3文件夹内的"_temporary"文件夹中。
Once the whole process is done - Apache spark completes the saveAsTextFile stage and then files inside "_temporary" folder in S3 are moved into the main folder. 整个过程完成后--Apache spark完成saveAsTextFile阶段，然后S3中"_temporary"文件夹中的文件被移动到主文件夹中。 This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. 这实际上需要很长时间[每个文件大约1分钟（平均大小：600 MB BZ2）]。 This part is not getting logged in the usual stderr log. 此部分未记录在通常的stderr日志中。

I'm using Apache Spark 1.0.1 with Hadoop 2.2 on AWS EMR. 我在AWS EMR上使用Apache Spark 1.0.1和Hadoop 2.2 。

Has anyone encountered this issue ? 有没有人遇到过这个问题？

Update 1 更新1

How can I increase the number of threads that does this move process ? 如何增加执行此移动过程的线程数？

Any suggestion is highly appreciated... 任何建议都非常感谢...

Thanks 谢谢

Answer 1

This was fixed with SPARK-3595 ( https://issues.apache.org/jira/browse/SPARK-3595 ). 这是通过SPARK-3595（ https://issues.apache.org/jira/browse/SPARK-3595 ）修复的。 Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark ). 其中包含在构建1.1.0.e及更高版本中（请参阅https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark ）。

Answer 2

I use below functions . 我使用以下功能。 it uploads file to s3. 它将文件上传到s3。 it uploads around 60 gb , gz files in 4-6 mins. 它在4-6分钟内上传大约60 gb，gz文件。

        ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
                ",");
        counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
                TextOutputFormat.class);

Make sure that you create more output files . 确保创建更多输出文件。 more number of smaller files will make upload faster. 更多数量的较小文件将使上传速度更快。

API details saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass : Class[ ], valueClass : Class[ ], outputFormatClass : Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit Output the RDD to any Hadoop-supported file system, compressing with the supplied codec. API详细信息saveAsHadoopFile [F <：org.apache.hadoop.mapred.OutputFormat [_， ]]（path：String， keyClass ：Class [ ]， valueClass ：Class [ ]， outputFormatClass ：Class [F]，codec：Class [ < ：org.apache.hadoop.io.compress.CompressionCodec]）：Unit将RDD输出到任何支持Hadoop的文件系统，并使用提供的编解码器进行压缩。

Apache Spark到S3上传性能问题

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-12-26 03:28:54

解决方案2
0 2014-09-28 14:37:11

Apache Spark到S3上传性能问题

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-12-26 03:28:54

解决方案2 0 2014-09-28 14:37:11

解决方案1
4 已采纳 2014-12-26 03:28:54

解决方案2
0 2014-09-28 14:37:11