简体   繁体   English

Apache Spark到S3上传性能问题

[英]Apache Spark to S3 upload performance Issue

I'm seeing a major performance issue when Apache Spark uploads its results to S3. 当Apache Spark将其结果上传到S3时,我发现了一个主要的性能问题。 As per my understanding it goes these steps... 根据我的理解,这些步骤......

  1. Output of final stage is written to _temp/ table in HDFS and the same is moved into "_temporary" folder inside the specific S3 folder. 最终阶段的输出被写入HDFS中的_temp/ table,同样被移动到特定S3文件夹内的"_temporary"文件夹中。

  2. Once the whole process is done - Apache spark completes the saveAsTextFile stage and then files inside "_temporary" folder in S3 are moved into the main folder. 整个过程完成后--Apache spark完成saveAsTextFile阶段,然后S3"_temporary"文件夹中的文件被移动到主文件夹中。 This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. 这实际上需要很长时间[每个文件大约1分钟(平均大小:600 MB BZ2)]。 This part is not getting logged in the usual stderr log. 此部分未记录在通常的stderr日志中。

I'm using Apache Spark 1.0.1 with Hadoop 2.2 on AWS EMR. 我在AWS EMR上使用Apache Spark 1.0.1Hadoop 2.2

Has anyone encountered this issue ? 有没有人遇到过这个问题?

Update 1 更新1

How can I increase the number of threads that does this move process ? 如何增加执行此移动过程的线程数?

Any suggestion is highly appreciated... 任何建议都非常感谢...

Thanks 谢谢

This was fixed with SPARK-3595 ( https://issues.apache.org/jira/browse/SPARK-3595 ). 这是通过SPARK-3595( https://issues.apache.org/jira/browse/SPARK-3595 )修复的。 Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark ). 其中包含在构建1.1.0.e及更高版本中(请参阅https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark )。

I use below functions . 我使用以下功能。 it uploads file to s3. 它将文件上传到s3。 it uploads around 60 gb , gz files in 4-6 mins. 它在4-6分钟内上传大约60 gb,gz文件。

        ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
                ",");
        counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
                TextOutputFormat.class);

Make sure that you create more output files . 确保创建更多输出文件。 more number of smaller files will make upload faster. 更多数量的较小文件将使上传速度更快。

API details saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass : Class[ ], valueClass : Class[ ], outputFormatClass : Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit Output the RDD to any Hadoop-supported file system, compressing with the supplied codec. API详细信息saveAsHadoopFile [F <:org.apache.hadoop.mapred.OutputFormat [_, ]](path:String, keyClass :Class [ ], valueClass :Class [ ], outputFormatClass :Class [F],codec:Class [ < :org.apache.hadoop.io.compress.CompressionCodec]):Unit将RDD输出到任何支持Hadoop的文件系统,并使用提供的编解码器进行压缩。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM