從龐大的文本語料庫中刪除停用詞的最有效方法是什么？

Question

我想知道從龐大的文本語料庫中刪除停用詞的有效方法。 目前，我的方法是將停用詞轉換為正則表達式，以使文本行與正則表達式匹配並將其刪除。

例如

String regex ="\\b(?:a|an|the|was|i)\\b\\s*";
 String line = "hi this is regex approach of stop word removal";
 String lineWithoutStopword = line.replaceAll(regex,"");

有沒有其他有效的方法可以刪除巨大的小詞句中的停用詞。

謝謝

Answer 1

使用Spark，一種方法是在用詞標記后從文本中減去停用詞。

val text = sc.textFile('huge.txt')
val stopWords = sc.textFile('stopwords.txt')
val words = text.flatMap(line => line.split("\\W"))
val clean = words.subtract(stopwords)

如果您需要處理非常大的文本文件（>> GBs），將停用詞集視為可以廣播給每個工作人員的內存結構會更有效。

代碼將像這樣更改：

val stopWords = sc.textFile('stopwords.txt')
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)
val words = text.flatMap(line => line.split("\\W"))
val clean = words.mapPartitions{iter =>
    val stopWordSet = stopWordSetBC.value
    iter.filter(word => !stopWordSet.contains(word))
}

請注意，為了使其正常工作，有必要對原始文本中的單詞進行標准化。

從龐大的文本語料庫中刪除停用詞的最有效方法是什么？

問題描述

1 個解決方案

解決方案1
4 2015-04-11 07:12:51

從龐大的文本語料庫中刪除停用詞的最有效方法是什么？

問題描述

1 個解決方案

解決方案1 4 2015-04-11 07:12:51

解決方案1
4 2015-04-11 07:12:51