[英]Hadoop WordCount for multiple words not getting the public variables
我有一個簡單的 Hadoop 程序,我需要為 mu 大學的一篇論文實施該程序。 這是一個替代的 WordCount 問題,它應該使組合的 Text() 字符串具有 n 個單詞,並且僅與 reducer 總結那些 >= 大於 k 出現的字符串。 我已將 n 和 k 整數放在輸入和 output 文件夾(args[3] 和 args[4])之后從命令行捕獲。 問題是 n 和 k 在 mapper 和 reducer 中使用時是空的,盡管從命令中正確獲取了它們的值。 代碼如下,有什么問題?
public class MultiWordCount {
public static int n;
public static int k;
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private StringBuilder phrase = new StringBuilder();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
for (int i=0; i<n; i++) {
if (itr.hasMoreTokens()) {
phrase.append(itr.nextToken());
phrase.append(" ");
}
}
word.set(phrase.toString());
context.write(word, one);
phrase.setLength(0);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
if(sum >= k) {
result.set(sum);
context.write(key, result);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
n = Integer.parseInt(args[2]);
k = Integer.parseInt(args[3]);
Job job = Job.getInstance(conf, "multi word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
盡管您基於 Java 的邏輯在這里看起來不錯,但 Map 和 Reduce function 在 Hadoop 中實現的功能比人們想象的更加短視或獨立。 To be more precise, you declare public static variables in the parent class and initialize them in the driver/main function, but the mapper/reducer instances do not have any access to the driver, but only to their strict scopes within the TokenizerMapper
and IntSumReducer
類。 這就是為什么當你查看映射器和減速器時n
和k
看起來是空的。
由於您的程序只有一個作業並且在單個 Hadoop Configuration
中執行,因此此處不需要Hadoop 計數器。 您可以通過TokenizerMapper
和IntSumReducer
類中的setup
函數,在執行 Map 和 Reduce 函數之前聲明每個映射器和化簡器將訪問的基於Configuration
的值。
要聲明這些類型的值以便將它們傳遞給 MapReduce 函數,您可以在驅動程序/主方法中執行以下操作:
conf.set("n", args[2]);
然后在TokenizerMapper
和IntSumReducer
的setup
方法中訪問此值(同時將其從String
轉換為int
):
n = Integer.parseInt(context.getConfiguration().get("n"));
所以程序可以如下所示:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.Counters;
import java.io.*;
import java.io.IOException;
import java.util.*;
import java.nio.charset.StandardCharsets;
public class MultiWordCount
{
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private StringBuilder phrase = new StringBuilder();
private int n;
protected void setup(Context context) throws IOException, InterruptedException
{
n = Integer.parseInt(context.getConfiguration().get("n"));
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
for (int i = 0; i < n; i++)
{
if (itr.hasMoreTokens())
{
phrase.append(itr.nextToken());
phrase.append(" ");
}
}
word.set(phrase.toString());
context.write(word, one);
phrase.setLength(0);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{
private IntWritable result = new IntWritable();
private int k;
protected void setup(Context context) throws IOException, InterruptedException
{
k = Integer.parseInt(context.getConfiguration().get("k"));
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
sum += val.get();
if(sum >= k)
{
result.set(sum);
context.write(key, result);
}
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
conf.set("n", args[2]);
conf.set("k", args[3]);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(new Path(args[1])))
fs.delete(new Path(args[1]), true);
Job job = Job.getInstance(conf, "Multi Word Count");
job.setJarByClass(MultiWordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
對於n=3
和k=1
,output 看起來像這樣(使用一些帶有弗朗茲卡夫卡句子的文本文件,如此處所示):
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.