[英]MapReduce: Filter out key-value pairs if value is not above threshold
使用MapReduce,如何修改以下單詞計數代碼,使其僅輸出高於特定計數閾值的單詞? (例如,我想添加某種鍵值對過濾。)
輸入:
ant bee cat
bee cat dog
cat dog
輸出:假設計數閾值為2或更大
cat 3
dog 2
以下代碼來自: http : //hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code
public static class Map1 extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
編輯:RE:關於輸入/測試用例
輸入文件(“ example.dat”)和一個簡單的測試用例(“ testcase”)可在此處找到: https : //github.com/csiu/tokens/tree/master/other/SO-26695749
編輯:
問題不在於代碼。 這是由於org.apache.hadoop.mapred
包之間的某些奇怪行為。 ( 使用mapred或mapreduce包創建Hadoop Job更好嗎? )。
org.apache.hadoop.mapreduce
嘗試添加if語句,然后再收集reduce中的輸出。
if(sum >= 2)
output.collect(key, new IntWritable(sum));
您可以在Reduce1類中進行過濾:
if (sum>=2) {
output.collect(key. new IntWritable(sum));
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.