MapReduce：如果值不超過閾值，則篩選出鍵值對

Question

使用MapReduce，如何修改以下單詞計數代碼，使其僅輸出高於特定計數閾值的單詞？ （例如，我想添加某種鍵值對過濾。）

輸入：

ant bee cat
bee cat dog
cat dog

輸出：假設計數閾值為2或更大

cat 3
dog 2

以下代碼來自： http : //hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code

public static class Map1 extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());
      output.collect(word, one);
    }
  }
}

public static class Reduce1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

編輯：RE：關於輸入/測試用例

輸入文件（“ example.dat”）和一個簡單的測試用例（“ testcase”）可在此處找到： https : //github.com/csiu/tokens/tree/master/other/SO-26695749

編輯：

問題不在於代碼。 這是由於org.apache.hadoop.mapred包之間的某些奇怪行為。 （使用mapred或mapreduce包創建Hadoop Job更好嗎？）。

重點：改用`org.apache.hadoop.mapreduce`

Answer 1

嘗試添加if語句，然后再收集reduce中的輸出。

if(sum >= 2)
    output.collect(key, new IntWritable(sum));

Answer 2

您可以在Reduce1類中進行過濾：

if (sum>=2) {
    output.collect(key. new IntWritable(sum));
}

MapReduce：如果值不超過閾值，則篩選出鍵值對

問題描述

重點：改用`org.apache.hadoop.mapreduce`

2 個解決方案

解決方案1
1 已采納 2014-11-02 03:34:23

解決方案2
1 2014-11-02 03:34:55

MapReduce：如果值不超過閾值，則篩選出鍵值對

問題描述

重點：改用org.apache.hadoop.mapreduce

2 個解決方案

解決方案1 1 已采納 2014-11-02 03:34:23

解決方案2 1 2014-11-02 03:34:55

重點：改用`org.apache.hadoop.mapreduce`

解決方案1
1 已采納 2014-11-02 03:34:23

解決方案2
1 2014-11-02 03:34:55