简体繁体 English

将Spark的输出合并到一个文件中

[英]merging output of Spark into one file

原文 2015-04-24 06:35:27 0 1 hadoop/ apache-spark/ hdfs

I understand that my question is similar to Merge Output files after reduce phase , however I think it may be different because I am using Spark only a local machine and not actually a distributed file system. 我知道我的问题类似于reduce阶段之后的合并输出文件，但是我认为可能有所不同，因为我仅在本地计算机上使用Spark，而实际上没有在分布式文件系统上使用Spark。

I have Spark installed on a single VM (for testing). 我在单个VM上安装了Spark（用于测试）。 The output is given in several files (part-000000, part-000001, etc...) in a folder called 'STjoin' in Home/Spark_Hadoop/spark-1.1.1-bin-cdh4/. 在Home / Spark_Hadoop / spark-1.1.1-bin-cdh4 /中名为“ STjoin”的文件夹中的多个文件中（部分000000，部分000001等）给出输出。

The command hadoop fs -getmerge /Spark_Hadoop/spark-1.1.1-bin-cdh4/STjoin /desired/local/output/file.txt does not seem to work ("No such file or director") 命令hadoop fs -getmerge /Spark_Hadoop/spark-1.1.1-bin-cdh4/STjoin /desired/local/output/file.txt似乎不起作用（“没有此类文件或目录”）

Is this because this command only applies to files stored in HDFS and not locally, or am I not understanding linux addresses in general? 这是因为此命令仅适用于HDFS中存储的文件，而不适用于本地文件，还是我一般不了解linux地址？ (I am new to both linux and HDFS) （我是linux和HDFS的新手）

1 个解决方案

Simply do cat /path/to/source/dir/* > /path/to/output/file.txt . 只需执行cat /path/to/source/dir/* > /path/to/output/file.txt 。 getmerge is the Hadoop version for HDFS-only files. getmerge是仅HDFS文件的Hadoop版本。