简体   繁体   中英

Remove an entire sentence containing a specific word with MapReduce

I am learning MapReduce and I want to read an input file (sentence by sentence) and write each sentence to an output file only if it does not contain the word "snake".

Eg input file:

This is my first sentence. This is my first sentence.
This is my first sentence.

The snake is an animal. This is the second sentence. This is my third sentence.

Another sentence. Another sentence with snake.

Then the output file should be:

This is my first sentence. This is my first sentence.
This is my first sentence.

This is the second sentence. This is my third sentence.

Another sentence.

To do so, I check, within the map method, if the sentence ( value ) contains the word snake. In case the sentence does not contain the snake word, then I write that sentence in the context .

Additionally, I set the number of reducer tasks to 0, otherwise in the output file I get the sentence in random order (eg the first sentence, then the third sentence, then the second sentence and so on).

My code does properly filters the sentence with the snake word but the problem is that it writes each sentence in a new line, like this:

This is my first sentence. 
 This is my first sentence. 

This is my first sentence. 
 This is the second sentence. 
 This is my third sentence. 


Another sentence. 

. 

How can I write a sentence in a new line only if that sentence appears in a new line in the input text? The following is my code:

public class RemoveSentence {

    public static class SentenceMapper extends Mapper<Object, Text, Text, NullWritable>{

        private Text removeWord = new Text ("snake");

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            if (!value.toString().contains(removeWord.toString())) {
                Text currentSentence = new Text(value.toString()+". ");
                context.write(currentSentence, NullWritable.get());
            }
        }
    }


    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("textinputformat.record.delimiter", ".");

        Job job = Job.getInstance(conf, "remove sentence");
        job.setJarByClass(RemoveSentence.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setMapperClass(SentenceMapper.class);
        job.setNumReduceTasks(0);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

This and this other solution said that should be sufficient to set context.write(word, null); but in my case did not work.

One more problem is related with conf.set("textinputformat.record.delimiter", "."); . Well, this is how I define the delimiter between sentences and because of this sometimes the sentence in the output file starts with a white space (eg the second This is my first sentence. ). As alternative I have tried to set it like this conf.set("textinputformat.record.delimiter", ". "); (with a space after the full stop) but in this way the Java app does not write in the output file all the sentences.

You are very close to solving the problem. Think about how your MapReduce program works. Your map method takes every single sentence separated by "." (default is newline as you know) as a new value and then writes it to file. You need a property which disables writing newlines after every map() call. I am not sure, but I don't think such a property exists.

One workaround would be let it process as normal. Example record will be:

This is first sentence. This is second snake. This is last.

Find the word "snake" and if found, remove everything immediately after the previous "." to the next "." Package the new String and write it to context.

Of course, if you can find a way to disable newlines after map() calls then that would be the easiest.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM