简体   繁体   English

如何通过gensim在LDA分析中删除单词

[英]How to remove a word in LDA analysis by gensim

I'm using gensim to do a LDA topic modeling work. 我正在使用gensim进行LDA主题建模工作。 My data was pretreated by some other people. 我的数据已被其他人预处理。 He gave me two things. 他给了我两件事。 ①the mmcorpus file(imported by gensim.corpora.MmCorpus function) ②the dictionary file(imported by gensim.corpora.Dictionary.load function) I created the LDA model successfully and adjusted the superparameter ALPHA from 0.5-1.5 and I drew a visualized chart like this: ①mmcorpus文件(由gensim.corpora.MmCorpus函数导入)②字典文件(由gensim.corpora.Dictionary.load函数导入)我成功创建了LDA模型,并将超参数ALPHA调整为0.5-1.5,并绘制了一个像这个: 在此处输入图片说明 I was confused why there are several tall bars there. 我很困惑,为什么那里有几根高大的酒吧。 And I found some strange words like this: 我发现了一些奇怪的词,像这样: 在此处输入图片说明 Interestingly the letter "b" which I haven't seen before appears. 有趣的是,我从未见过的字母“ b”出现了。 The man who gave me the data said the letter "b" may generated automatically when he converted the data into bytes type. 给我数据的那个人说,当他将数据转换成字节类型时,字母“ b”可能会自动生成。 He doesn't know how to erase the "b" neither do I. How can I delete the "b" when I just have the mmcorpus file and the dictionary file? 他不知道如何删除“ b”,我也不知道。当我只有mmcorpus文件和字典文件时,如何删除“ b”? Please! 请!

gensim has a function for filtering out specific tokens from the dictionary. gensim具有从字典中过滤掉特定标记的功能。 You just have to know their corresponding ID. 您只需要知道其相应的ID。 As for the corpus, I am not aware of any built-in functions that let you modify its content. 至于语料库,我不知道有任何内置函数可让您修改其内容。 You can however convert the (usually sparse) corpus to a dense numpy array, delete a column and convert it back to MmCorpus format. 但是,您可以将(通常是稀疏的)语料库转换为密集的numpy数组,删除列并将其转换回MmCorpus格式。 After that, you should be able to use both the modified dictionary and corpus to train a new LDA model, this time without the unwanted words. 之后,您应该能够同时使用修改后的字典和语料库来训练新的LDA模型,这一次不会出现不需要的单词。 Here is my shot at it with a small toy corpus: 这是我用一个小的玩具语料库拍摄的照片:

import gensim
import numpy as np

# toy document set
texts = ['This is my first b', 'Another b just like so']
tokenlist = [list(gensim.utils.tokenize(text)) for text in texts]

# create dictionary and MmCorpus
dictionary = gensim.corpora.Dictionary(tokenlist)
corpus = [dictionary.doc2bow(tokens) for tokens in tokenlist]
gensim.corpora.MmCorpus.serialize('MmCorpusTest.mm', corpus)

# assume the word 'b' is to be deleted, put its id in a variable
del_ids = [k for k,v in dictionary.items() if v=='b']

# remove unwanted word ids from the dictionary in place
dictionary.filter_tokens(bad_ids=del_ids)

# load corpus from your file
corpusMm = gensim.corpora.MmCorpus('MmCorpusTest.mm')
# convert corpus to a dense array, transpose because by default documents would be columns
np_corpus = gensim.matutils.corpus2dense(corpusMm, corpusMm.num_terms, num_docs=corpusMm.num_docs).T
# delete columns for specified tokens, transpose back afterwards
np_corpus = np.delete(np_corpus, del_ids, 1).T
# convert array to corpus
new_corpus = gensim.matutils.Dense2Corpus(np_corpus)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM