简体   繁体   中英

Gensim Topic Modeling with Mallet Perplexity

I am topic modelling Harvard Library book title and subjects.

I use Gensim Mallet Wrapper to model with Mallet's LDA. When I try to get Coherence and Perplexity values to see how good the model is, perplexity fails to calculate with below exception. I do not get the same error if I use Gensim's built-in LDA model instead of Mallet. My corpus holds 7M+ documents of length up to 50 words averaging 20. So documents are short.

Below is the related part of my code:

# TOPIC MODELING

from gensim.models import CoherenceModel
num_topics = 50

# Build Gensim's LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                       random_state=100,
                                       update_every=1,
                                       chunksize=100,
                                       passes=10,
                                       alpha='auto',
                                       per_word_topics=True)

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

Perplexity: -47.91929228302663

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Coherence Score: 0.28852857563541856

LDA gave scores without problem. Now I model the same bag of words with MALLET

# Building LDA Mallet Model
mallet_path = '~/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, 
corpus=corpus, num_topics=num_topics, id2word=id2word)

# Convert mallet to gensim type
mallet_model = 
gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=mallet_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

Coherence Score: 0.5994123896865993

Then I ask for the Perplexity values and get below warnings and NaN value.

# Compute Perplexity
print('\nPerplexity: ', mallet_model.log_perplexity(corpus))

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1108: RuntimeWarning: invalid value encountered in multiply score += np.sum((self.eta - _lambda) * Elogbeta)

Perplexity: nan

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1109: RuntimeWarning: invalid value encountered in subtract score += np.sum(gammaln(_lambda) - gammaln(self.eta))

I realize this is a very Gensim specific question and requires deeper knowledge of this function: gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

Hence I would appreciate any comment on warnings and the Gensim domain.

I do not think that the perplexity function is implemented for the Mallet wrapper. As mentioned in Radims answer , the perplexity is displayed to the stdout:

AFAIR, Mallet displays the perplexity to stdout -- would that be enough for you? Capturing these values programmatically should be possible too, but I haven't looked into that. Hopefully Mallet has some API call for perplexity eval too, but it's certainly not included in the wrapper.

I just ran it on a sample corpus, and the LL/token was indeed printed every so much iterations:

LL/token: -9.45493

perplexity = 2^(-LL/token) = 701.81

Few cents from me.

  1. It Seems In lda_model.log_perplexity(corpus) , you use the same corpus you use for training. I might have better luck with a held-out/test set of the corpus.
  2. lda_model.log_perplexity(corpus) doesn't return Perplexity. It returns "bound". If you want to turn it to Perplexity, do np.exp2(-bound) . I was struggling with this for some time :)
  3. There is no way to use Mallet wrapper to report Perplexity afaik

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM