简体   繁体   中英

Get how many times a word appears in a text file and link it to a text file

I currently have 3 text files with the data

Textfile1
Hello World
Bye World

Textfile2
Hello World
Hello Second

How do I get a result of

Hello {Textfile1 = 1, Textfile2 =2}
World {Textfile1 = 2, Textfile2 = 1}

Currently I have managed to pass the words from my Map into my Reduce java page. This is where I am stuck at at the moment.

public class Reduce extends Reducer<Text, Text, Text, Text> {
    HashMap<Text, Integer>input = new HashMap<Text, Integer>();

    public void reduce(Text key, Iterable<Text> values , Context context)
    throws IOException, InterruptedException {
        int sum = 0;
        for(Text val: values){
            String word = key.toString();
            Text filename;
            input.put(val,sum );
                if(//not sure what to write here){

               }
            }
       context.write(new Text(key), input);
}

My mapper code

public class Map extends Mapper<LongWritable, Text, Text, Text> {

private Text file = new Text();
private Text word = new Text();
private String pattern= "^[a-z][a-z0-9]*$";//any lower case letter or number

public void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException {

    InputSplit inputSplit = context.getInputSplit();
    String fileName = ((FileSplit)inputSplit).getPath().getName();
    file.set(fileName);
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
       word.set(tokenizer.nextToken());

        String stringWord = word.toString().toLowerCase();
        if (stringWord.matches(pattern)){
            context.write(new Text(stringWord), new Text(fileName));

        }
    }
}

}

Hope I can get some help

In output of the mapper we can set the text file name as key and each row in the file as the value.

The file name can be retrieved using the below code snippet in Mapper class.

FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();

Then in the reducer

public class Reduce extends Reducer<Text, Text, Text, Text> {
HashMap<Text, Integer>input = new HashMap<Text, Integer>();

public void reduce(Text key, Iterable<Text> values , Context context)
throws IOException, InterruptedException {
    int sum = 0;
    for(Text val: values){
        String word = val.toString(); -- processing each row
        String[] wordarray = word.split(' ');
        for(int i=0 ; i<wordarray.length; i++)
       {
        if(input.get(wordarray[i]) == null){
        input.put(wordarray[i],1);}
        else{
         int value =input.get(wordarray[i]) +1 ; 
         input.put(wordarray[i],value);
         }
       }     

   context.write(new Text(key), new Text(input.toString()));
}

You could write a custom writable class for map key. Something like a textpair which would hold filename,word and value would be 1.

Map Output

<K,V> ==> <MytextpairWritable,new IntWritable(1)>

And just sum up the value in the reducer side and emit the value. Something like this.

public class Reduce extends Reducer<mytextpairWritable, IntWritable,mytextpairWritable, IntWritable> {


    public void reduce(mytextpairWritable key, Iterable<IntWritable> values , Context context)
    throws IOException, InterruptedException {
        int sum = 0;
        for(IntWritable val: values){
            sum+=val.get();
            }
       context.write(key, new IntWritable(sum));
}

This would give you something like

File1,hello,2
File2,hello,3
File3,hello,1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM