在Gensim中添加停用词

Question

Thanks for stopping by! 感谢您的光临！ I had a quick question about appending stop words. 我有一个关于添加停用词的快速问题。 I have a select few words that show up in my data set and I was hopping I could add them to gensims stop word list. 我的数据集中显示了一些单词，但我希望可以将它们添加到gensims停止单词列表中。 I've seen a lot of examples using nltk and I was hoping there would be a way to do the same in gensim. 我已经看到了很多使用nltk的示例，我希望有一种方法可以在gensim中进行相同的操作。 I'll post my code below: 我将在下面发布我的代码：

 def preprocess(text): result = [] for token in gensim.utils.simple_preprocess(text): if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3: nltk.bigrams(token) result.append(lemmatize_stemming(token)) return result

Answer 1

While gensim.parsing.preprocessing.STOPWORDS is pre-defined for your convenience, and happens to be a frozenset so it can't be directly added-to, you could easily make a larger set that includes both those words and your additions. 为方便起见， gensim.parsing.preprocessing.STOPWORDS已预先定义，并且碰巧是frozenset因此无法直接添加到其中，但您可以轻松地创建一个更大的集，包括这些单词和您的添加内容。 For example: 例如：

from gensim.parsing.preprocessing import STOPWORDS
my_stop_words = STOPWORDS.union(set(['mystopword1', 'mystopword2']))

Then use the new, larger my_stop_words in your subsequent stop-word-removal code. 然后在后续的停用词删除代码中使用新的较大的my_stop_words 。 (The simple_preprocess() function of gensim doesn't automatically remove stop-words.) （ gensim的simple_preprocess()函数不会自动删除停用词。）

Answer 2

 def preprocess(text): result = [] for token in gensim.utils.simple_preprocess(text): newStopWords = ['stopword1','stopword2'] if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3: nltk.bigrams(token) result.append(lemmatize_stemming(token)) return result

在Gensim中添加停用词

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-03-19 20:54:04

解决方案2
0 2019-03-19 21:15:52

在Gensim中添加停用词

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-03-19 20:54:04

解决方案2 0 2019-03-19 21:15:52

解决方案1
1 已采纳 2019-03-19 20:54:04

解决方案2
0 2019-03-19 21:15:52