what is the most efficient way of removing stop words from huge text corpus ?

Question

i want to know the efficient way to remove the stop words from huge text corpus. currently my approach is to convert stopword in to regex match the lines of text with regex and remove it.

eg

String regex ="\\b(?:a|an|the|was|i)\\b\\s*";
 String line = "hi this is regex approach of stop word removal";
 String lineWithoutStopword = line.replaceAll(regex,"");

Is there and other efficient approach present to remove stopwords from huge corupus.

thanks

Answer 1

Using Spark, one way would be to subtract the stop words from the text after it has been tokenized in words.

val text = sc.textFile('huge.txt')
val stopWords = sc.textFile('stopwords.txt')
val words = text.flatMap(line => line.split("\\W"))
val clean = words.subtract(stopwords)

If you need to process very large files of text (>>GBs) it will be more efficient to treat the set of stopwords as an in-memory structure that can be broadcasted to each worker.

The code would change like this:

val stopWords = sc.textFile('stopwords.txt')
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)
val words = text.flatMap(line => line.split("\\W"))
val clean = words.mapPartitions{iter =>
    val stopWordSet = stopWordSetBC.value
    iter.filter(word => !stopWordSet.contains(word))
}

Note that normalization of words of the original text will be necessary for this to work properly.

what is the most efficient way of removing stop words from huge text corpus ?

Question

1 answers

solution1
4 2015-04-11 07:12:51

what is the most efficient way of removing stop words from huge text corpus ?

Question

1 answers

solution1 4 2015-04-11 07:12:51

solution1
4 2015-04-11 07:12:51