简体   繁体   English

将Spark的输出合并到一个文件中

[英]merging output of Spark into one file

I understand that my question is similar to Merge Output files after reduce phase , however I think it may be different because I am using Spark only a local machine and not actually a distributed file system. 我知道我的问题类似于reduce阶段之后的合并输出文件 ,但是我认为可能有所不同,因为我仅在本地计算机上使用Spark,而实际上没有在分布式文件系统上使用Spark。

I have Spark installed on a single VM (for testing). 我在单个VM上安装了Spark(用于测试)。 The output is given in several files (part-000000, part-000001, etc...) in a folder called 'STjoin' in Home/Spark_Hadoop/spark-1.1.1-bin-cdh4/. 在Home / Spark_Hadoop / spark-1.1.1-bin-cdh4 /中名为“ STjoin”的文件夹中的多个文件中(部分000000,部分000001等)给出输出。

The command hadoop fs -getmerge /Spark_Hadoop/spark-1.1.1-bin-cdh4/STjoin /desired/local/output/file.txt does not seem to work ("No such file or director") 命令hadoop fs -getmerge /Spark_Hadoop/spark-1.1.1-bin-cdh4/STjoin /desired/local/output/file.txt似乎不起作用(“没有此类文件或目录”)

Is this because this command only applies to files stored in HDFS and not locally, or am I not understanding linux addresses in general? 这是因为此命令仅适用于HDFS中存储的文件,而不适用于本地文件,还是我一般不了解linux地址? (I am new to both linux and HDFS) (我是linux和HDFS的新手)

Simply do cat /path/to/source/dir/* > /path/to/output/file.txt . 只需执行cat /path/to/source/dir/* > /path/to/output/file.txt getmerge is the Hadoop version for HDFS-only files. getmerge是仅HDFS文件的Hadoop版本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM