简体   繁体   English

从庞大的文本语料库中删除停用词的最有效方法是什么?

[英]what is the most efficient way of removing stop words from huge text corpus ?

i want to know the efficient way to remove the stop words from huge text corpus. 我想知道从庞大的文本语料库中删除停用词的有效方法。 currently my approach is to convert stopword in to regex match the lines of text with regex and remove it. 目前,我的方法是将停用词转换为正则表达式,以使文本行与正则表达式匹配并将其删除。

eg 例如

String regex ="\\b(?:a|an|the|was|i)\\b\\s*";
 String line = "hi this is regex approach of stop word removal";
 String lineWithoutStopword = line.replaceAll(regex,"");

Is there and other efficient approach present to remove stopwords from huge corupus. 有没有其他有效的方法可以删除巨大的小词句中的停用词。

thanks 谢谢

Using Spark, one way would be to subtract the stop words from the text after it has been tokenized in words. 使用Spark,一种方法是在用词标记后从文本中减去停用词。

val text = sc.textFile('huge.txt')
val stopWords = sc.textFile('stopwords.txt')
val words = text.flatMap(line => line.split("\\W"))
val clean = words.subtract(stopwords)

If you need to process very large files of text (>>GBs) it will be more efficient to treat the set of stopwords as an in-memory structure that can be broadcasted to each worker. 如果您需要处理非常大的文本文件(>> GBs),将停用词集视为可以广播给每个工作人员的内存结构会更有效。

The code would change like this: 代码将像这样更改:

val stopWords = sc.textFile('stopwords.txt')
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)
val words = text.flatMap(line => line.split("\\W"))
val clean = words.mapPartitions{iter =>
    val stopWordSet = stopWordSetBC.value
    iter.filter(word => !stopWordSet.contains(word))
}

Note that normalization of words of the original text will be necessary for this to work properly. 请注意,为了使其正常工作,有必要对原始文本中的单词进行标准化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从散列中排除单词的最有效方法 - Most efficient way to exclude words from hashing 从字符串数组中删除 Java 中的停用词的最省时方法 - Most time efficient way to remove stop words in Java from an array of strings 以最有效的方式从数据库中获取大量数据 - Getting a huge amount of data from database in the most efficient way Java - 从Array []中删除一组元素的最有效方法是什么 - Java - What's the most efficient way of removing a set of elements from an Array[] 读取大文件的最有效方法 - most efficient way to read huge file 从文本文件一次向数组列表中添加 3 个字符的最有效方法是什么? - What is the most efficient way to add 3 characters at a time to an araylist from a text file? 用Java编写大型文本文件的最有效方法是什么? - What's the most efficient way to write large text file in java? 确定文本文件长度的最有效方法是什么? - What is the most efficient way to determine length of a text file? 在此集合上执行文本替换的最有效方法是什么? - What would be the most efficient way of performing text substitution on this collection? 从语料库中找到匹配的常用单词或短语的高效算法 - Efficient algorithm to find matching common words or phrases from a corpus
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM