Apache Spark to S3 upload performance Issue

Question

I'm seeing a major performance issue when Apache Spark uploads its results to S3. As per my understanding it goes these steps...

Output of final stage is written to _temp/ table in HDFS and the same is moved into "_temporary" folder inside the specific S3 folder.
Once the whole process is done - Apache spark completes the saveAsTextFile stage and then files inside "_temporary" folder in S3 are moved into the main folder. This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. This part is not getting logged in the usual stderr log.

I'm using Apache Spark 1.0.1 with Hadoop 2.2 on AWS EMR.

Has anyone encountered this issue ?

Update 1

How can I increase the number of threads that does this move process ?

Any suggestion is highly appreciated...

Thanks

Answer 1

This was fixed with SPARK-3595 ( https://issues.apache.org/jira/browse/SPARK-3595 ). Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark ).

Answer 2

I use below functions . it uploads file to s3. it uploads around 60 gb , gz files in 4-6 mins.

        ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
                ",");
        counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
                TextOutputFormat.class);

Make sure that you create more output files . more number of smaller files will make upload faster.

API details saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass : Class[ ], valueClass : Class[ ], outputFormatClass : Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit Output the RDD to any Hadoop-supported file system, compressing with the supplied codec.

Apache Spark to S3 upload performance Issue

Question

2 answers

solution1
4 ACCPTED 2014-12-26 03:28:54

solution2
0 2014-09-28 14:37:11

Apache Spark to S3 upload performance Issue

Question

2 answers

solution1 4 ACCPTED 2014-12-26 03:28:54

solution2 0 2014-09-28 14:37:11

solution1
4 ACCPTED 2014-12-26 03:28:54

solution2
0 2014-09-28 14:37:11