如何使用 gensim 从 ldamodel 中获取主题概率？

Question

data1=[tokens.doc2bow(text) for text in texts]
ldamodel=gensim.models.ldamodel.LdaModel(corpus=data1,id2word=tokens,num_topics=10,random_state=100,update_every=1,chunksize=10,passes=10,alpha='auto',per_word_topics=True)
print(*ldamodel.print_topics(),sep="\n")
lda=ldamodel[data1]
l=[ldamodel.get_document_topics(item) for item in data1]
print(l)

While executing get_document_topics() , it is giving an output of hundreds of lines (as shown in picture).在执行get_document_topics() ，它给出了数百行的输出（如图所示）。 I don't know what does it mean.我不知道这是什么意思。 I actually want the probabilities of topics.我实际上想要主题的概率。 Which method should I use to get the topic probabilities?我应该使用哪种方法来获取主题概率？

Answer 1

Those are the topic probabilitiies.这些是主题概率。 Your line...你的线...

l=[ldamodel.get_document_topics(item) for item in data1]

...essentially says, "give me a list, where each entry in that list is the topic-probabilities for the same entry in data ". ...本质上说，“给我一个列表，该列表中的每个条目都是data同一条目的主题概率”。

So, the very first item in that returned list...因此，该返回列表中的第一项...

[(0, 0.974673)]

...means that your very-first document is assigned a 97.4673% chance of being in topic #0. ...意味着您的第一个文档被分配到主题 #0 的可能性为 97.4673%。

If you instead want the probabilities for a single document, say the document in slot 6, you'd instead run:如果您想要单个文档的概率，比如槽 6 中的文档，您应该运行：

doc_6_topics = ldamodel.get_document_topics(data1[6])

So your existing code is already reporting the per-doc topic probabilities., If your true need is, "How do I get these into another format for another purpose?", you should edit/expand your question with more details about why the existing return value doesn't meet your needs, and what would meet your needs, and what you're trying to do next.因此，您现有的代码已经在报告每个文档的主题概率。，如果您真正需要的是，“我如何将这些转换为另一种格式以用于其他目的？”，您应该编辑/扩展您的问题，详细说明为什么现有的返回值不满足您的需求，什么会满足您的需求，以及您接下来要做什么。

Separate notes:单独注释：

It'd be better to share raw formatted text of what you're seeing, than screenshots - see some reasons here与屏幕截图相比，分享您所看到内容的原始格式文本会更好 - 请参阅此处的一些原因
It's a little concerning that of the excerpt of output shown – your early documents – they all wind up in topic #0.与显示的输出摘录有关——您的早期文档——它们都以主题#0 结束。 If in fact your training data is "clumpy", with all related documents in a row, it can be helpful to shuffle them before model training, so that documents of any particular topic might appear anywhere, instead of "all at the front" or "all at the back".如果实际上你的训练数据是“块状的”，所有相关文档都排成一排，那么在模型训练之前将它们打乱会很有帮助，这样任何特定主题的文档可能会出现在任何地方，而不是“全部在前面”或“都在后面”。

如何使用 gensim 从 ldamodel 中获取主题概率？

问题描述

1 个解决方案

解决方案1
0 2020-03-16 17:01:54

如何使用 gensim 从 ldamodel 中获取主题概率？

问题描述

1 个解决方案

解决方案1 0 2020-03-16 17:01:54

解决方案1
0 2020-03-16 17:01:54