I am new in MapReduce and I wanted to ask if someone can give me an idea to perform word length frequency using MapReduce. I've already have the code for word count but I wanted to use word length, this is what I've got so far.
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Thanks ...
For word length frequency, tokenizer.nextToken()
shouldn't be emit as key
. The length of that string actually be considered. So your code will do fine with just the following change and is sufficient :
word.set( String.valueOf( tokenizer.nextToken().length() ));
Now if you give deep look, you will realize that Mapper
output key should no longer be Text
although it works. Better use an IntWritable
key instead :
public static class Map extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private IntWritable wordLength = new IntWritable();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
wordLength.set(tokenizer.nextToken().length());
context.write(wordLength, one);
}
}
}
Although most of the MapReduce
examples use StringTokenizer
, it's cleaner and advisable to use String.split
method. So make the changes accordingly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.