从庞大的文本语料库中删除停用词的最有效方法是什么？

Question

我想知道从庞大的文本语料库中删除停用词的有效方法。 目前，我的方法是将停用词转换为正则表达式，以使文本行与正则表达式匹配并将其删除。

例如

String regex ="\\b(?:a|an|the|was|i)\\b\\s*";
 String line = "hi this is regex approach of stop word removal";
 String lineWithoutStopword = line.replaceAll(regex,"");

有没有其他有效的方法可以删除巨大的小词句中的停用词。

谢谢

Answer 1

使用Spark，一种方法是在用词标记后从文本中减去停用词。

val text = sc.textFile('huge.txt')
val stopWords = sc.textFile('stopwords.txt')
val words = text.flatMap(line => line.split("\\W"))
val clean = words.subtract(stopwords)

如果您需要处理非常大的文本文件（>> GBs），将停用词集视为可以广播给每个工作人员的内存结构会更有效。

代码将像这样更改：

val stopWords = sc.textFile('stopwords.txt')
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)
val words = text.flatMap(line => line.split("\\W"))
val clean = words.mapPartitions{iter =>
    val stopWordSet = stopWordSetBC.value
    iter.filter(word => !stopWordSet.contains(word))
}

请注意，为了使其正常工作，有必要对原始文本中的单词进行标准化。

从庞大的文本语料库中删除停用词的最有效方法是什么？

问题描述

1 个解决方案

解决方案1
4 2015-04-11 07:12:51

从庞大的文本语料库中删除停用词的最有效方法是什么？

问题描述

1 个解决方案

解决方案1 4 2015-04-11 07:12:51

解决方案1
4 2015-04-11 07:12:51