spark中的saveAsTextFile方法

Question

In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use 在我的项目中，我有三个输入文件，并将文件名称为args（0）到args（2），我也有一个输出文件名为args（3），在源代码中，我使用

val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))

I do nothing to the log but save it as a text file by using 我对日志没有任何作用，但是使用它将其保存为文本文件

log.coalesce(1, true).saveAsTextFile(args(args.size - 1))

but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file? 但它仍然保存为3文件作为00000部分，部分00001，部分00002，那么有什么方法可以将三个输入文件保存到输出文件？

Answer 1

Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. 拥有多个输出文件是Hadoop或Spark等多机群集的标准行为。 The number of output files depends on the number of reducers. 输出文件的数量取决于reducer的数量。

How to "solve" it in Hadoop: merge output files after reduce phase 如何在Hadoop中“解决”它：在reduce阶段之后合并输出文件

How to "solve" in Spark: how to make saveAsTextFile NOT split output into multiple file? 如何在Spark中“解决”：如何使saveAsTextFile不将输出分割成多个文件？

A good info you can get also here: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html 您也可以在这里获得一个很好的信息： http ： //apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html

So, you were right about coalesce(1,true) . 所以，你是合适的coalesce(1,true) 。 However, it is very inefficient. 但是，效率非常低。 Interesting is that (as @climbage mentioned in his remark) your code is working if you run it locally. 有趣的是（正如@climbage在他的评论中提到的），如果你在本地运行它，你的代码就可以了。

What you might try is to read the files first and then save the output. 您可能尝试的是先读取文件然后保存输出。

...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
   val file = sc.textFile(args(i))       
   file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")

Note: this code is also extremely inefficient and working for small files only!!! 注意：此代码效率极低，仅适用于小文件！ You need to come up with a better code. 你需要提出一个更好的代码。 I wouldn't try to reduce number of file but process multiple outputs files instead. 我不会尝试减少文件数量，而是处理多个输出文件。

Answer 2

As mentioned your problem is somewhat unavoidable via the standard API's as the assumption is that you are dealing with large quanatities of data. 如上所述，通过标准API，您的问题有些不可避免，因为假设您正在处理大量数据。 However, if I assume your data is manageable you could try the following 但是，如果我假设您的数据是可管理的，您可以尝试以下方法

import java.nio.file.{Paths, Files}    
import java.nio.charset.StandardCharsets

Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))

What I am doing here is converting the RDD into a String by performing a collect and then mkString. 我在这里做的是通过执行collect然后执行mkString将RDD转换为String。 I would suggest not doing this in production. 我建议不要在制作中这样做。 It works fine for local data analysis (Working with 5gb~ of local data) 它适用于本地数据分析（使用5gb本地数据）

spark中的saveAsTextFile方法

问题描述

2 个解决方案

解决方案1
2 2015-01-02 15:09:56

解决方案2
0 2015-01-02 17:41:35

spark中的saveAsTextFile方法

问题描述

2 个解决方案

解决方案1 2 2015-01-02 15:09:56

解决方案2 0 2015-01-02 17:41:35

解决方案1
2 2015-01-02 15:09:56

解决方案2
0 2015-01-02 17:41:35