简体   繁体   English

spark中的saveAsTextFile方法

[英]saveAsTextFile method in spark

In my project, I have three input files and make the file names as args(0) to args(2), I also have a output filename as args(3), in the source code, I use 在我的项目中,我有三个输入文件,并将文件名称为args(0)到args(2),我也有一个输出文件名为args(3),在源代码中,我使用

val sc = new SparkContext()
var log = sc.textFile(args(0))
for(i <- 1 until args.size - 1) log = log.union(sc.textFile(args(i)))

I do nothing to the log but save it as a text file by using 我对日志没有任何作用,但是使用它将其保存为文本文件

log.coalesce(1, true).saveAsTextFile(args(args.size - 1))

but it still save to 3 file as part-00000、part-00001、part-00002, So is there any way that I can save the three input files to an output file? 但它仍然保存为3文件作为00000部分,部分00001,部分00002,那么有什么方法可以将三个输入文件保存到输出文件?

Having multiple output files is a standard behavior of multi-machine clusters like Hadoop or Spark. 拥有多个输出文件是Hadoop或Spark等多机群集的标准行为。 The number of output files depends on the number of reducers. 输出文件的数量取决于reducer的数量。

How to "solve" it in Hadoop: merge output files after reduce phase 如何在Hadoop中“解决”它: 在reduce阶段之后合并输出文件

How to "solve" in Spark: how to make saveAsTextFile NOT split output into multiple file? 如何在Spark中“解决”: 如何使saveAsTextFile不将输出分割成多个文件?

A good info you can get also here: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html 您也可以在这里获得一个很好的信息: http//apache-spark-user-list.1001560.n3.nabble.com/How-to-make-Spark-merge-the-output-file-td322.html

So, you were right about coalesce(1,true) . 所以,你是合适的coalesce(1,true) However, it is very inefficient. 但是,效率非常低。 Interesting is that (as @climbage mentioned in his remark) your code is working if you run it locally. 有趣的是(正如@climbage在他的评论中提到的),如果你在本地运行它,你的代码就可以了。

What you might try is to read the files first and then save the output. 您可能尝试的是先读取文件然后保存输出。

...
val sc = new SparkContext()
var str = new String("")
for(i <- 0 until args.size - 1){
   val file = sc.textFile(args(i))       
   file.foreach(line => str+= line)
}
//and now you might save the content
str.coalesce(1, true).saveAsTextFile("out")

Note: this code is also extremely inefficient and working for small files only!!! 注意:此代码效率极低,仅适用于小文件! You need to come up with a better code. 你需要提出一个更好的代码。 I wouldn't try to reduce number of file but process multiple outputs files instead. 我不会尝试减少文件数量,而是处理多个输出文件。

As mentioned your problem is somewhat unavoidable via the standard API's as the assumption is that you are dealing with large quanatities of data. 如上所述,通过标准API,您的问题有些不可避免,因为假设您正在处理大量数据。 However, if I assume your data is manageable you could try the following 但是,如果我假设您的数据是可管理的,您可以尝试以下方法

import java.nio.file.{Paths, Files}    
import java.nio.charset.StandardCharsets

Files.write(Paths.get("./test_file"), data.collect.mkString("\n").getBytes(StandardCharsets.UTF_8))

What I am doing here is converting the RDD into a String by performing a collect and then mkString. 我在这里做的是通过执行collect然后执行mkString将RDD转换为String。 I would suggest not doing this in production. 我建议不要在制作中这样做。 It works fine for local data analysis (Working with 5gb~ of local data) 它适用于本地数据分析(使用5gb本地数据)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM