Hadoop：提供目錄作為MapReduce作業的輸入

Question

我正在使用Cloudera Hadoop。 我能夠運行簡單的mapreduce程序，我提供了一個文件作為MapReduce程序的輸入。

此文件包含mapper函數要處理的所有其他文件。

但是，我陷入了困境。

/folder1
  - file1.txt
  - file2.txt
  - file3.txt

如何將MapReduce程序的輸入路徑指定為"/folder1" ，以便它可以開始處理該目錄中的每個文件？

有任何想法嗎？

編輯：

1）Intiailly，我提供了inputFile.txt作為mapreduce程序的輸入。 它工作得很好。

>inputFile.txt
file1.txt
file2.txt
file3.txt

2）但是現在，我想在命令行上提供一個輸入目錄作為arg [0]，而不是給出一個輸入文件。

hadoop jar ABC.jar /folder1 /output

Answer 1

問題是FileInputFormat不會在輸入路徑dir中遞歸讀取文件。

解決方案：使用以下代碼

FileInputFormat.setInputDirRecursive(job, true); 在Map Reduce Code下面的行之前

FileInputFormat.addInputPath(job, new Path(args[0]));

您可以在此處查看修復的版本。

Answer 2

您可以使用FileSystem.listStatus從給定的dir獲取文件列表，代碼如下：

//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf); 
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
    for(FileStatus status : status_list){
        //add each file to the list of inputs for the map-reduce job
        FileInputFormat.addInputPath(conf, status.getPath());
    }
}

Answer 3

您可以使用hdfs 通配符來提供多個文件

所以，解決方案：

hadoop jar ABC.jar /folder1/* /output

要么

hadoop jar ABC.jar /folder1/*.txt /output

Answer 4

使用MultipleInputs類。

MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat> 
inputFormatClass, Class<? extends Mapper> mapperClass)

看看工作代碼

Hadoop：提供目錄作為MapReduce作業的輸入

問題描述

4 個解決方案

解決方案1
13 2014-05-28 09:33:50

解決方案2
2 2013-11-20 13:14:12

解決方案3
1 2015-11-07 11:02:32

解決方案4
0 2016-01-07 15:27:20

Hadoop：提供目錄作為MapReduce作業的輸入

問題描述

4 個解決方案

解決方案1 13 2014-05-28 09:33:50

解決方案2 2 2013-11-20 13:14:12

解決方案3 1 2015-11-07 11:02:32

解決方案4 0 2016-01-07 15:27:20

解決方案1
13 2014-05-28 09:33:50

解決方案2
2 2013-11-20 13:14:12

解決方案3
1 2015-11-07 11:02:32

解決方案4
0 2016-01-07 15:27:20