[英]merging output of Spark into one file
I understand that my question is similar to Merge Output files after reduce phase , however I think it may be different because I am using Spark only a local machine and not actually a distributed file system. 我知道我的问题类似于reduce阶段之后的合并输出文件 ,但是我认为可能有所不同,因为我仅在本地计算机上使用Spark,而实际上没有在分布式文件系统上使用Spark。
I have Spark installed on a single VM (for testing). 我在单个VM上安装了Spark(用于测试)。 The output is given in several files (part-000000, part-000001, etc...) in a folder called 'STjoin' in Home/Spark_Hadoop/spark-1.1.1-bin-cdh4/.
在Home / Spark_Hadoop / spark-1.1.1-bin-cdh4 /中名为“ STjoin”的文件夹中的多个文件中(部分000000,部分000001等)给出输出。
The command hadoop fs -getmerge /Spark_Hadoop/spark-1.1.1-bin-cdh4/STjoin /desired/local/output/file.txt
does not seem to work ("No such file or director") 命令
hadoop fs -getmerge /Spark_Hadoop/spark-1.1.1-bin-cdh4/STjoin /desired/local/output/file.txt
似乎不起作用(“没有此类文件或目录”)
Is this because this command only applies to files stored in HDFS and not locally, or am I not understanding linux addresses in general? 这是因为此命令仅适用于HDFS中存储的文件,而不适用于本地文件,还是我一般不了解linux地址? (I am new to both linux and HDFS)
(我是linux和HDFS的新手)
Simply do cat /path/to/source/dir/* > /path/to/output/file.txt
. 只需执行
cat /path/to/source/dir/* > /path/to/output/file.txt
。 getmerge
is the Hadoop version for HDFS-only files. getmerge
是仅HDFS文件的Hadoop版本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.