Java Hadoop：我如何創建作為輸入文件的輸出器並給出一個輸出，即每個文件中的行數？

Question

我是Hadoop的新手，我只是運行wordCount示例： http ：//hadoop.apache.org/common/docs/r0.18.2/mapred_tutorial.html

假設我們有一個包含3個文件的文件夾。 我希望每個文件都有一個映射器，這個映射器只計算行數並將其返回到reducer。

然后，reducer將輸入每個映射器的行數作為輸入，並將所有3個文件中存在的總行數作為輸出。

所以，如果我們有以下3個文件

input1.txt
input2.txt
input3.txt

並且映射器返回：

mapper1 -> [input1.txt, 3]
mapper2 -> [input2.txt, 4]
mapper3 -> [input3.txt, 9]

減速器將輸出

3+4+9 = 16

我在一個簡單的java應用程序中完成了這個，所以我想在Hadoop中完成它。 我只有一台計算機，並希望嘗試在偽分布式環境中運行。

我怎樣才能實現這個目標？ 我應該采取什么適當的措施？

我的代碼應該在apache的示例中看起來像那樣嗎？ 我將有兩個靜態類，一個用於mapper，一個用於reducer？ 或者我應該有3個類，每個映射器一個？

如果你能指導我完成這個，我不知道如何做到這一點，我相信如果我設法編寫一些代碼來做這些東西，那么我將來能夠編寫更復雜的應用程序。

謝謝！

Answer 1

除了sa125的答案之外，你可以通過不為每個輸入記錄發出記錄來大大提高性能，而只是在映射器中累積一個計數器，然后在mapper清理方法中，發出文件名和計數值：

public class LineMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    protected long lines = 0;

    @Override
    protected void cleanup(Context context) throws IOException,
            InterruptedException {
        FileSplit split = (FileSplit) context.getInputSplit();
        String filename = split.getPath().toString();

        context.write(new Text(filename), new LongWritable(lines));
    }

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        lines++;
    }
}

Answer 2

我注意到你使用的是0.18版本的文檔。 這是1.0.2 （最新）的鏈接。

第一個建議 - 使用IDE（eclipse，IDEA等）。 填補空白真的很有幫助。

在實際的HDFS中，您無法知道文件的每個部分所在的位置（不同的計算機和群集）。 沒有保證行X甚至與行Y駐留在同一磁盤上。也不能保證行X不會在不同的機器上分割（HDFS以塊的形式分配數據，通常每塊64Mb）。 這意味着您不能假設相同的映射器將處理整個文件。 您可以確保每個文件都由同一個reducer處理 。

由於reducer對於映射器發送的每個鍵都是唯一的，所以我這樣做的方法是使用文件名作為映射器中的輸出鍵。 此外，映射器的默認輸入類是TextInputFormat ，這意味着每個映射器將自己接收整行（由LF或CR終止）。 然后，您可以從映射器中發出文件名和數字1（或者其他與計算無關的內容）。 然后，在reducer中，您只需使用一個循環來計算接收文件名的次數：

在mapper的map函數中

public static class Map extends Mapper<IntWritable, Text, Text, Text> {

  public void map(IntWritable key, Text value, Context context) {
    // get the filename
    InputSplit split = context.getInputSplit();
    String fileName = split.getPath().getName();

    // send the filename to the reducer, the value
    // has no meaning (I just put "1" to have something)
    context.write( new Text(fileName), new Text("1") );
  }

}

在reducer的reduce函數中

public static class Reduce extends Reducer<Text, Text, Text, Text> {

  public void reduce(Text fileName, Iterator<Text> values, Context context) {
    long rowcount = 0;

    // values get one entry for each row, so the actual value doesn't matter
    // (you can also get the size, I'm just lazy here)
    for (Text val : values) {
      rowCount += 1;
    }

    // fileName is the Text key received (no need to create a new object)
    context.write( fileName, new Text( String.valueOf( rowCount ) ) );
  }

}

在司機/主要

您幾乎可以使用與wordcount示例相同的驅動程序 - 請注意，我使用了新的mapreduce API，因此您需要調整一些內容（ Job而不是JobConf等）。 當我讀到它時，這真的很有幫助。

請注意，您的MR輸出將只是每個文件名及其行數：

input1.txt    3
input2.txt    4
input3.txt    9

如果您只想計算所有文件中的TOTAL行數，只需在所有映射器中發出相同的鍵（而不是文件名）。 這樣，只有一個reducer可以處理所有行計數：

// no need for filename
context.write( new Text("blah"), new Text("1") );

您還可以鏈接一個工作，該工作將處理每個文件行數的輸出，或者做其他奇特的工作 - 這取決於您。

我留下了一些樣板代碼，但基礎知識就在那里。 一定要檢查我，因為我從記憶中輸入了大部分內容.. :)

希望這可以幫助！

Java Hadoop：我如何創建作為輸入文件的輸出器並給出一個輸出，即每個文件中的行數？

問題描述

2 個解決方案

解決方案1
11 2012-04-29 19:06:40

解決方案2
5 2012-04-28 21:34:43

在mapper的map函數中

在reducer的reduce函數中

在司機/主要

Java Hadoop：我如何創建作為輸入文件的輸出器並給出一個輸出，即每個文件中的行數？

問題描述

2 個解決方案

解決方案1 11 2012-04-29 19:06:40

解決方案2 5 2012-04-28 21:34:43

在mapper的map函數中

在reducer的reduce函數中

在司機/主要

解決方案1
11 2012-04-29 19:06:40

解決方案2
5 2012-04-28 21:34:43