[英]MapReduce: Filter out key-value pairs if value is not above threshold
Using MapReduce, how do you modify the following word count code such that it will only output words above a certain count threshold? 使用MapReduce,如何修改以下单词计数代码,使其仅输出高于特定计数阈值的单词? (eg I want add some kind of filtering of key-value pairs.) (例如,我想添加某种键值对过滤。)
Input: 输入:
ant bee cat
bee cat dog
cat dog
Output: let say count threshold is 2 or more 输出:假设计数阈值为2或更大
cat 3
dog 2
Following code is from: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code 以下代码来自: http : //hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code
public static class Map1 extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
EDIT: RE: about inputs/testcase 编辑:RE:关于输入/测试用例
Input file ("example.dat") and a simple test case ("testcase") is found here: https://github.com/csiu/tokens/tree/master/other/SO-26695749 输入文件(“ example.dat”)和一个简单的测试用例(“ testcase”)可在此处找到: https : //github.com/csiu/tokens/tree/master/other/SO-26695749
EDIT: 编辑:
The problem wasn't the code. 问题不在于代码。 It was due to some strange behavior between the org.apache.hadoop.mapred
package. 这是由于org.apache.hadoop.mapred
包之间的某些奇怪行为。 ( Is it better to use the mapred or the mapreduce package to create a Hadoop Job? ). ( 使用mapred或mapreduce包创建Hadoop Job更好吗? )。
org.apache.hadoop.mapreduce
instead 重点:改用org.apache.hadoop.mapreduce
Try adding an if statement before collecting the output in reduce. 尝试添加if语句,然后再收集reduce中的输出。
if(sum >= 2)
output.collect(key, new IntWritable(sum));
You can just do the filtering in the Reduce1 class: 您可以在Reduce1类中进行过滤:
if (sum>=2) {
output.collect(key. new IntWritable(sum));
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.