简体   繁体   English

计算非 gensim 主题 model 的连贯性

[英]Calculate coherence for non-gensim topic model

I've built a topic model, with:我建立了一个主题 model,其中:

  • Input : list of tokenized lists输入:标记化列表的列表
  • Output : a m xt matrix (with each cell indicating the probability of word i appearing in topic k ). Output :一个m xt矩阵(每个单元格表示单词i出现在主题k中的概率)。
  • Output : a k xn matrix (with each cell indicating the probability of topic k in document j ). Output :一个k xn矩阵(每个单元格表示文档j中主题k的概率)。

To find the optimal number of topics, I want to calculate the coherence for a model.为了找到最佳主题数,我想计算 model 的连贯性。 However, I am only aware of Gensim 's Coherencemodel , which seems to require a Gensim model as input.但是,我只知道GensimCoherencemodel ,这似乎需要 Gensim model 作为输入。

Are there any other packages/implementations that I could use to calculate the coherence of a computed topic model?是否有任何其他包/实现可用于计算计算主题 model 的连贯性? Or, if it is indeed possible to use the Coherencemodel without inputting a LDAmodel, could someone show me how to do that?或者,如果确实可以在不输入 LDA 模型的情况下使用Coherencemodel ,有人可以告诉我该怎么做吗?

Actually, you can do this with the Gensim package.实际上,您可以使用 Gensim package 做到这一点。

input_data = list of list with tokenized texts input_data = 带有标记化文本的列表列表

topics = list with top N words per topic主题 = 每个主题前 N 个单词的列表

import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

id2word = corpora.Dictionary(input_data)
corpus = [id2word.doc2bow(text) for text in input_data]

cm = CoherenceModel(topics=topics,texts = input_data,corpus=corpus, dictionary=id2word, coherence='c_v')
coherence = cm.get_coherence()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM