简体   繁体   English

如何使用 gensim 从 ldamodel 中获取主题概率?

[英]how to get topic probability from the ldamodel by using gensim?

data1=[tokens.doc2bow(text) for text in texts]
ldamodel=gensim.models.ldamodel.LdaModel(corpus=data1,id2word=tokens,num_topics=10,random_state=100,update_every=1,chunksize=10,passes=10,alpha='auto',per_word_topics=True)
print(*ldamodel.print_topics(),sep="\n")
lda=ldamodel[data1]
l=[ldamodel.get_document_topics(item) for item in data1]
print(l)

While executing get_document_topics() , it is giving an output of hundreds of lines (as shown in picture).在执行get_document_topics() ,它给出了数百行的输出(如图所示)。 I don't know what does it mean.我不知道这是什么意思。 I actually want the probabilities of topics.我实际上想要主题的概率。 Which method should I use to get the topic probabilities?我应该使用哪种方法来获取主题概率?

get_document_topics() 的输出

Those are the topic probabilitiies.这些是主题概率。 Your line...你的线...

l=[ldamodel.get_document_topics(item) for item in data1]

...essentially says, "give me a list, where each entry in that list is the topic-probabilities for the same entry in data ". ...本质上说,“给我一个列表,该列表中的每个条目都是data同一条目的主题概率”。

So, the very first item in that returned list...因此,该返回列表中的第一项...

[(0, 0.974673)]

...means that your very-first document is assigned a 97.4673% chance of being in topic #0. ...意味着您的第一个文档被分配到主题 #0 的可能性为 97.4673%。

If you instead want the probabilities for a single document, say the document in slot 6, you'd instead run:如果您想要单个文档的概率,比如槽 6 中的文档,您应该运行:

doc_6_topics = ldamodel.get_document_topics(data1[6])

So your existing code is already reporting the per-doc topic probabilities., If your true need is, "How do I get these into another format for another purpose?", you should edit/expand your question with more details about why the existing return value doesn't meet your needs, and what would meet your needs, and what you're trying to do next.因此,您现有的代码已经在报告每个文档的主题概率。,如果您真正需要的是,“我如何将这些转换为另一种格式以用于其他目的?”,您应该编辑/扩展您的问题,详细说明为什么现有的返回值不满足您的需求,什么会满足您的需求,以及您接下来要做什么。

Separate notes:单独注释:

  • It'd be better to share raw formatted text of what you're seeing, than screenshots - see some reasons here与屏幕截图相比,分享您所看到内容的原始格式文本会更好 - 请参阅此处的一些原因

  • It's a little concerning that of the excerpt of output shown – your early documents – they all wind up in topic #0.与显示的输出摘录有关——您的早期文档——它们都以主题#0 结束。 If in fact your training data is "clumpy", with all related documents in a row, it can be helpful to shuffle them before model training, so that documents of any particular topic might appear anywhere, instead of "all at the front" or "all at the back".如果实际上你的训练数据是“块状的”,所有相关文档都排成一排,那么在模型训练之前将它们打乱会很有帮助,这样任何特定主题的文档可能会出现在任何地方,而不是“全部在前面”或“都在后面”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM