简体   繁体   English

如何初始化gensim LDA主题模型?

[英]How can I initialize a gensim LDA topic model?

It has been suggested that initializing a topic model using clusters of words can lead to higher quality models or more robust (consistent) inference. 已经提出,使用单词簇来初始化主题模型可以导致更高质量的模型或更健壮(一致)的推断。 I am talking about initializing the optimizer, not setting a prior. 我说的是初始化优化器,而不是设置先验。 Here is some code to illustrate what I want to do: 这是一些代码来说明我想做什么:

Create an LdaModel object, but don't pass in a corpus. 创建一个LdaModel对象,但不要传递语料库。

lda_model =
LdaModel(
         id2word=id2word,
         num_topics=30,
         eval_every=10,
         pass=40,
         iterations=5000)

Next assign some property of the object, corresponding to the probabilities of drawing each word from a topic to a matrix of my own construction. 接下来,分配对象的某些属性,对应于将每个单词从主题绘制到我自己的结构矩阵中的可能性。

lda_model.topics = my_topic_mat

Then fit the corpus: 然后适合语料库:

lda_model.update(corpus)

Thanks for the help! 谢谢您的帮助!

In practice, setting a prior may be a better choice than initializing the optimizer. 实际上,设置优先级可能比初始化优化程序更好。

There are two hyperparameters alpha and eta , where alpha is a prior for the document-topic matrix and eta is a prior for the topic-word matrix. 有两个超参数alphaeta ,其中alpha是文档主题矩阵的先验,而eta是主题词矩阵的先验。 To influence word probabilities in topics, try passing eta as an additional argument: 要影响主题中的单词概率,请尝试将eta作为附加参数传递:

lda_model = gensim.models.ldamodel.LdaModel(num_topics=30, id2word=id2word, eta=your_topic_mat, 
                                            eval_every=10, iterations=5000)

From the gensim docs : gensim文档

eta can be a scalar for a symmetric prior over topic/word distributions, or a vector of shape num_words, which can be used to impose (user defined) asymmetric priors over the word distribution. eta可以是主题/单词分布上的对称先验的标量,或者是num_words形状的向量,可用于对单词分布施加(用户定义的)非对称先验。 It also supports the special value 'auto', which learns an asymmetric prior over words directly from your data. 它还支持特殊值“ auto”,该值直接从您的数据中学习单词的不对称先验。 eta can also be a matrix of shape num_topics x num_words, which can be used to impose asymmetric priors over the word distribution on a per-topic basis (can not be learned from data). eta也可以是形状为num_topics x num_words的矩阵,可用于在每个主题的基础上对单词分布施加非对称先验(无法从数据中获知)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM