简体繁体 English

Hadoop将多个零件文件合并为一个文件

[英]Hadoop Combine Multiple part files into single file

原文 2016-02-10 11:58:16 6 2 java/ hadoop/ mapreduce/ hdfs

Currently I have 目前我有

part-00001 part-00002

I know that using hdfs -getmerge is the best way to combine those files into a single one. 我知道使用hdfs -getmerge是将这些文件合并为一个文件的最佳方法。 However, is it possible to do it programmatically ? 但是，是否可以通过编程方式进行 ？

I've tried using MultipleOutput , but it is not working. 我尝试使用MultipleOutput ，但是它不起作用。 I've also tried writing my own CustomOutputFormat however due to multiple reducers during writing it to the file in parallel it gives org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException error when closing the Dataoutputstream. 我还尝试编写自己的CustomOutputFormat但是由于在并行将其并行写入文件时使用了多个reducer，因此在关闭Dataoutputstream时会出现org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException错误。

2 个解决方案

You can always use the FileSystem class from your java code, and probably calling the concat method is all you need. 您始终可以从Java代码中使用FileSystem类，可能只需要调用concat方法即可。

MultipleOutput does almost the opposite. MultipleOutput几乎相反。 Instead of having part-xxxxx files, it also produces customly named files, which typically means more files than before. 除了生成part-xxxxx文件之外，它还会生成自定义名称的文件，这通常意味着比以前更多的文件。

CustomOuputFormat is also not a good idea, since in any case, you will have as many output files as the number of reducers. CustomOuputFormat也不是一个好主意，因为在任何情况下，您的输出文件都将与减速器数量一样多。 The output format will not change that. 输出格式不会改变它。

Using a single reducer ( setNumReduceTasks(1) ) could be a working solution, but unnecessarily expensive, since it "kills" parallelism (all the data are processed by a single task). 使用单个reducer（ setNumReduceTasks(1) ）可能是一个setNumReduceTasks(1)解决方案，但是不必要地昂贵，因为它“杀死”了并行性（所有数据都由单个任务处理）。 Consider using it only if your data is rather small, otherwise avoid it. 仅在您的数据很小时才考虑使用它，否则请避免使用它。

Another solution would be to simply call hdfs -getmerge as a shell command from your java code, after the MapReduce job is complete. 另一种解决方案是在MapReduce作业完成后，从Java代码中简单地将hdfs -getmerge作为shell命令调用。

You cannot do it programmatically as its managed by Hadoop and these files are created depends on the number of reducers configured . 您无法通过Hadoop对其进行编程编程，并且这些文件的创建取决于所配置的reducer数量。 Why do you need to merge these files programatically ? 为什么需要以编程方式合并这些文件？ If for input as another job , you can always mention the directory as input and use CombineInputFormat if there are lot of small part- files . 如果要作为另一项工作输入，则始终可以将目录提及为输入，如果有很多小CombineInputFormat文件，则可以使用CombineInputFormat 。 Otherwise hdfs -getmerge is the best option if you want to merge your own . 否则，如果要合并自己的hdfs -getmerge是最佳选择。