spark SAVEASTEXTfile 花费大量时间 - 1.6.3

Question

I Extract Data from Mongo.我从 Mongo 中提取数据。 Process the data and then store the data in HDFS.处理数据，然后将数据存储在 HDFS 中。

Extraction and Processing of 1M records completes is less than 1.1 Minute. 100 万条记录的提取和处理完成不到 1.1 分钟。

Extraction Code提取码

JavaRDD<Document> rdd = MongoSpark.load(jsc);

Processing Code处理代码

              JavaRDD<String> fullFile = rdd.map(new Function<Document, String>() {

                           public String call(Document s) {
//                         System.out.println(" About to Transform Json ----- " + s.toJson());
                            return JsonParsing.returnKeyJson(JsonParsing.returnFlattenMapJson(s.toJson()),args[3].split(","),extractionDetails);
                }
         });
System.out.println("Records Downloaded - " + fullFile.count());

This complete is less than 1.1 Minute.整个过程不到 1.1 分钟。 As i fetch the count of RDD at that point.当我在那时获取 RDD 的计数时。

After that i have Save command which as follows,之后，我有如下保存命令，

  fullFile
   .coalesce(1)
   .saveAsTextFile(args[4], GzipCodec.class);

This takes atleast 15 to 20 min to save it into HDFS.这至少需要 15 到 20 分钟才能将其保存到 HDFS。

Not sure why it takes much time.不知道为什么需要很多时间。 Let me know if anything can be done to faster the process.让我知道是否可以采取任何措施来加快进程。

I am using the following options to run it, --num-executors 4 --executor-memory 4g --executor-cores 4我使用以下选项来运行它， --num-executors 4 --executor-memory 4g --executor-cores 4

If i increase the # of executors or Memory , still it does not make any differences.如果我增加 # of executors 或 Memory ，它仍然没有任何区别。 I have set the # of Partitions to 70 , not sure if i increase this there might be performance ?我已将分区数设置为 70，不确定是否增加它可能会有性能？

Any suggestion to reduce the time of Save will be helpfull.任何减少保存时间的建议都会有所帮助。

Thanks in Advance提前致谢

Answer 1

fullFile
   .coalesce(1)
   .saveAsTextFile(args[4], GzipCodec.class);

Here you're using coalesce(1) means you're reducing no.在这里，您使用coalesce(1)意味着您正在减少 no。 of partition to 1 only that's why it is taking more time.分区为 1 只有这就是为什么它需要更多时间。 As Their is only one partition at the time of writing so only one task/executor will write the whole data in desired location.由于它们在写入时只有一个分区，因此只有一个任务/执行程序会将整个数据写入所需的位置。 If you want to write faster than increase the partition value in coalesce.如果你想写得比在合并中增加分区值更快。 Simply remove coalesce or increase value in coalesce .只需删除coalesce或增加coalesce值。 You can no.你不能。 partition while writing data in spark UI.在 Spark UI 中写入数据时进行分区。

spark SAVEASTEXTfile 花费大量时间 - 1.6.3

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-09-25 05:25:48

spark SAVEASTEXTfile 花费大量时间 - 1.6.3

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-09-25 05:25:48

解决方案1
1 已采纳 2020-09-25 05:25:48