[英]Gensim Topic Modeling with Mallet Perplexity
I am topic modelling Harvard Library book title and subjects.我正在为哈佛图书馆书名和主题建模。
I use Gensim Mallet Wrapper to model with Mallet's LDA.我使用 Gensim Mallet Wrapper 用 Mallet 的 LDA 建模。 When I try to get Coherence and Perplexity values to see how good the model is, perplexity fails to calculate with below exception.当我尝试获取 Coherence 和 Perplexity 值以查看模型有多好时,perplexity 无法计算,但有以下异常。 I do not get the same error if I use Gensim's built-in LDA model instead of Mallet.如果我使用 Gensim 的内置 LDA 模型而不是 Mallet,我不会得到同样的错误。 My corpus holds 7M+ documents of length up to 50 words averaging 20. So documents are short.我的语料库包含 700 万个文档,长度不超过 50 个单词,平均 20 个单词。所以文档很短。
Below is the related part of my code:以下是我的代码的相关部分:
# TOPIC MODELING
from gensim.models import CoherenceModel
num_topics = 50
# Build Gensim's LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))
# a measure of how good the model is. lower the better.
Perplexity: -47.91929228302663困惑度:-47.91929228302663
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model,
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Coherence Score: 0.28852857563541856连贯性得分:0.28852857563541856
LDA gave scores without problem. LDA 给出了没有问题的分数。 Now I model the same bag of words with MALLET现在我用 MALLET 模拟同样的词袋
# Building LDA Mallet Model
mallet_path = '~/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path,
corpus=corpus, num_topics=num_topics, id2word=id2word)
# Convert mallet to gensim type
mallet_model =
gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=mallet_model,
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)
Coherence Score: 0.5994123896865993连贯性得分:0.5994123896865993
Then I ask for the Perplexity values and get below warnings and NaN value.然后我询问 Perplexity 值并得到低于警告和 NaN 值。
# Compute Perplexity
print('\nPerplexity: ', mallet_model.log_perplexity(corpus))
/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1108: RuntimeWarning: invalid value encountered in multiply score += np.sum((self.eta - _lambda) * Elogbeta) /app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1108: RuntimeWarning: 在乘法分数中遇到无效值 += np.sum((self.eta - _lambda) * Elogbeta )
Perplexity: nan困惑:难
/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1109: RuntimeWarning: invalid value encountered in subtract score += np.sum(gammaln(_lambda) - gammaln(self.eta)) /app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1109: RuntimeWarning: 在减分中遇到无效值 += np.sum(gammaln(_lambda) - gammaln(self. eta))
I realize this is a very Gensim specific question and requires deeper knowledge of this function: gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)我意识到这是一个非常特定于 Gensim 的问题,需要对这个函数有更深入的了解:gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)
Hence I would appreciate any comment on warnings and the Gensim domain.因此,对于警告和 Gensim 域的任何评论,我将不胜感激。
I do not think that the perplexity function is implemented for the Mallet wrapper.我认为没有为 Mallet 包装器实现 perplexity 函数。 As mentioned in Radims answer , the perplexity is displayed to the stdout:正如Radims answer 中提到的,困惑会显示到标准输出:
AFAIR, Mallet displays the perplexity to stdout -- would that be enough for you? AFAIR,Mallet 显示了对标准输出的困惑——这对你来说足够了吗? Capturing these values programmatically should be possible too, but I haven't looked into that.以编程方式捕获这些值也应该是可能的,但我还没有研究过。 Hopefully Mallet has some API call for perplexity eval too, but it's certainly not included in the wrapper.希望 Mallet 也有一些 API 调用 perplexity eval,但它肯定没有包含在包装器中。
I just ran it on a sample corpus, and the LL/token was indeed printed every so much iterations:我只是在示例语料库上运行它,并且 LL/token 确实每隔这么多次迭代就打印一次:
LL/token: -9.45493 LL/代币:-9.45493
perplexity = 2^(-LL/token) = 701.81困惑度 = 2^(-LL/token) = 701.81
Few cents from me.我的几分钱。
lda_model.log_perplexity(corpus)
, you use the same corpus you use for training.似乎在lda_model.log_perplexity(corpus)
,您使用的语料库与用于训练的语料库相同。 I might have better luck with a held-out/test set of the corpus.我可能对语料库的保留/测试集有更好的运气。np.exp2(-bound)
.如果你想把它变成np.exp2(-bound)
,做np.exp2(-bound)
。 I was struggling with this for some time :)我为此苦苦挣扎了一段时间:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.