MapReduce：如果值不超过阈值，则筛选出键值对

Question

Using MapReduce, how do you modify the following word count code such that it will only output words above a certain count threshold? 使用MapReduce，如何修改以下单词计数代码，使其仅输出高于特定计数阈值的单词？ (eg I want add some kind of filtering of key-value pairs.) （例如，我想添加某种键值对过滤。）

Input: 输入：

ant bee cat
bee cat dog
cat dog

Output: let say count threshold is 2 or more 输出：假设计数阈值为2或更大

cat 3
dog 2

Following code is from: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code 以下代码来自： http : //hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code

public static class Map1 extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());
      output.collect(word, one);
    }
  }
}

public static class Reduce1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

EDIT: RE: about inputs/testcase 编辑：RE：关于输入/测试用例

Input file ("example.dat") and a simple test case ("testcase") is found here: https://github.com/csiu/tokens/tree/master/other/SO-26695749 输入文件（“ example.dat”）和一个简单的测试用例（“ testcase”）可在此处找到： https : //github.com/csiu/tokens/tree/master/other/SO-26695749

EDIT: 编辑：

The problem wasn't the code. 问题不在于代码。 It was due to some strange behavior between the org.apache.hadoop.mapred package. 这是由于org.apache.hadoop.mapred包之间的某些奇怪行为。 ( Is it better to use the mapred or the mapreduce package to create a Hadoop Job? ). （使用mapred或mapreduce包创建Hadoop Job更好吗？）。

Point: use `org.apache.hadoop.mapreduce` instead 重点：改用`org.apache.hadoop.mapreduce`

Answer 1

Try adding an if statement before collecting the output in reduce. 尝试添加if语句，然后再收集reduce中的输出。

if(sum >= 2)
    output.collect(key, new IntWritable(sum));

Answer 2

You can just do the filtering in the Reduce1 class: 您可以在Reduce1类中进行过滤：

if (sum>=2) {
    output.collect(key. new IntWritable(sum));
}

MapReduce：如果值不超过阈值，则筛选出键值对

问题描述

Point: use `org.apache.hadoop.mapreduce` instead 重点：改用`org.apache.hadoop.mapreduce`

2 个解决方案

解决方案1
1 已采纳 2014-11-02 03:34:23

解决方案2
1 2014-11-02 03:34:55

MapReduce：如果值不超过阈值，则筛选出键值对

问题描述

Point: use org.apache.hadoop.mapreduce instead 重点：改用org.apache.hadoop.mapreduce

2 个解决方案

解决方案1 1 已采纳 2014-11-02 03:34:23

解决方案2 1 2014-11-02 03:34:55

Point: use `org.apache.hadoop.mapreduce` instead 重点：改用`org.apache.hadoop.mapreduce`

解决方案1
1 已采纳 2014-11-02 03:34:23

解决方案2
1 2014-11-02 03:34:55