简体   繁体   中英

MapReduce find word length frequency

I am new in MapReduce and I wanted to ask if someone can give me an idea to perform word length frequency using MapReduce. I've already have the code for word count but I wanted to use word length, this is what I've got so far.

public class WordCount  {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }
}

}

Thanks ...

For word length frequency, tokenizer.nextToken() shouldn't be emit as key . The length of that string actually be considered. So your code will do fine with just the following change and is sufficient :

word.set( String.valueOf( tokenizer.nextToken().length() ));  

Now if you give deep look, you will realize that Mapper output key should no longer be Text although it works. Better use an IntWritable key instead :

public static class Map extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private IntWritable wordLength = new IntWritable();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            wordLength.set(tokenizer.nextToken().length());
            context.write(wordLength, one);
        }
    }
}

Although most of the MapReduce examples use StringTokenizer , it's cleaner and advisable to use String.split method. So make the changes accordingly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM