[英]how to get topic probability from the ldamodel by using gensim?
data1=[tokens.doc2bow(text) for text in texts]
ldamodel=gensim.models.ldamodel.LdaModel(corpus=data1,id2word=tokens,num_topics=10,random_state=100,update_every=1,chunksize=10,passes=10,alpha='auto',per_word_topics=True)
print(*ldamodel.print_topics(),sep="\n")
lda=ldamodel[data1]
l=[ldamodel.get_document_topics(item) for item in data1]
print(l)
While executing get_document_topics()
, it is giving an output of hundreds of lines (as shown in picture).在执行
get_document_topics()
,它给出了数百行的输出(如图所示)。 I don't know what does it mean.我不知道这是什么意思。 I actually want the probabilities of topics.
我实际上想要主题的概率。 Which method should I use to get the topic probabilities?
我应该使用哪种方法来获取主题概率?
Those are the topic probabilitiies.这些是主题概率。 Your line...
你的线...
l=[ldamodel.get_document_topics(item) for item in data1]
...essentially says, "give me a list, where each entry in that list is the topic-probabilities for the same entry in data
". ...本质上说,“给我一个列表,该列表中的每个条目都是
data
同一条目的主题概率”。
So, the very first item in that returned list...因此,该返回列表中的第一项...
[(0, 0.974673)]
...means that your very-first document is assigned a 97.4673% chance of being in topic #0. ...意味着您的第一个文档被分配到主题 #0 的可能性为 97.4673%。
If you instead want the probabilities for a single document, say the document in slot 6, you'd instead run:如果您想要单个文档的概率,比如槽 6 中的文档,您应该运行:
doc_6_topics = ldamodel.get_document_topics(data1[6])
So your existing code is already reporting the per-doc topic probabilities., If your true need is, "How do I get these into another format for another purpose?", you should edit/expand your question with more details about why the existing return value doesn't meet your needs, and what would meet your needs, and what you're trying to do next.因此,您现有的代码已经在报告每个文档的主题概率。,如果您真正需要的是,“我如何将这些转换为另一种格式以用于其他目的?”,您应该编辑/扩展您的问题,详细说明为什么现有的返回值不满足您的需求,什么会满足您的需求,以及您接下来要做什么。
Separate notes:单独注释:
It'd be better to share raw formatted text of what you're seeing, than screenshots - see some reasons here与屏幕截图相比,分享您所看到内容的原始格式文本会更好 - 请参阅此处的一些原因
It's a little concerning that of the excerpt of output shown – your early documents – they all wind up in topic #0.与显示的输出摘录有关——您的早期文档——它们都以主题#0 结束。 If in fact your training data is "clumpy", with all related documents in a row, it can be helpful to shuffle them before model training, so that documents of any particular topic might appear anywhere, instead of "all at the front" or "all at the back".
如果实际上你的训练数据是“块状的”,所有相关文档都排成一排,那么在模型训练之前将它们打乱会很有帮助,这样任何特定主题的文档可能会出现在任何地方,而不是“全部在前面”或“都在后面”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.