简体   繁体   中英

Apache Spark to S3 upload performance Issue

I'm seeing a major performance issue when Apache Spark uploads its results to S3. As per my understanding it goes these steps...

  1. Output of final stage is written to _temp/ table in HDFS and the same is moved into "_temporary" folder inside the specific S3 folder.

  2. Once the whole process is done - Apache spark completes the saveAsTextFile stage and then files inside "_temporary" folder in S3 are moved into the main folder. This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. This part is not getting logged in the usual stderr log.

I'm using Apache Spark 1.0.1 with Hadoop 2.2 on AWS EMR.

Has anyone encountered this issue ?

Update 1

How can I increase the number of threads that does this move process ?

Any suggestion is highly appreciated...

Thanks

This was fixed with SPARK-3595 ( https://issues.apache.org/jira/browse/SPARK-3595 ). Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark ).

I use below functions . it uploads file to s3. it uploads around 60 gb , gz files in 4-6 mins.

        ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
                ",");
        counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
                TextOutputFormat.class);

Make sure that you create more output files . more number of smaller files will make upload faster.

API details saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass : Class[ ], valueClass : Class[ ], outputFormatClass : Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit Output the RDD to any Hadoop-supported file system, compressing with the supplied codec.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM