简体   繁体   English

如何使用gensim从受约束的词汇中过滤出语料库中的单词?

[英]How to filter out words in a corpus from a constrained vocabulary with gensim?

I am using gensim for topic modeling. 我正在使用gensim进行主题建模。 I've created a corpus using 我创建了一个语料库

wordDict = corpora.Dictionary(trimmedTextTokens)

gsCorpus = [wordDict.doc2bow(text) for text in trimmedTextTokens]

where trimmedTextTokens are the result of removing stop words. 其中trimmedTextTokens是删除停用词的结果。 Now I want to filter out the terms from the corpus that are not in a list of a restricted or constructed vocabulary. 现在,我想从语料库中筛选出不在受限制或构造的词汇表中的术语。 Any ideas? 有任何想法吗? Thank you!! 谢谢!!

Assuming your restricted vocabulary list is in a variable named restrictedVocabularyList you could do: 假设您的限制词汇表位于名为restrictedVocabularyList的变量中,则可以执行以下操作:

wordDict = corpora.Dictionary(trimmedTextTokens)

gsCorpus = [wordDict.doc2bow(text) for text in trimmedTextTokens if text in restrictedVocabularyList]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM