I'm seeing a major performance issue when Apache Spark uploads its results to S3. As per my understanding it goes these steps...
Output of final stage is written to _temp/
table in HDFS and the same is moved into "_temporary"
folder inside the specific S3
folder.
Once the whole process is done - Apache spark completes the saveAsTextFile
stage and then files inside "_temporary"
folder in S3
are moved into the main folder. This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. This part is not getting logged in the usual stderr
log.
I'm using Apache Spark 1.0.1
with Hadoop 2.2
on AWS EMR.
Has anyone encountered this issue ?
Update 1
How can I increase the number of threads that does this move process ?
Any suggestion is highly appreciated...
Thanks
This was fixed with SPARK-3595 ( https://issues.apache.org/jira/browse/SPARK-3595 ). Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark ).
I use below functions . it uploads file to s3. it uploads around 60 gb , gz files in 4-6 mins.
ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
",");
counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
TextOutputFormat.class);
Make sure that you create more output files . more number of smaller files will make upload faster.
API details saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass : Class[ ], valueClass : Class[ ], outputFormatClass : Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit Output the RDD to any Hadoop-supported file system, compressing with the supplied codec.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.