使用 Mallet Perplexity 進行 Gensim 主題建模

Question

我正在為哈佛圖書館書名和主題建模。

我使用 Gensim Mallet Wrapper 用 Mallet 的 LDA 建模。 當我嘗試獲取 Coherence 和 Perplexity 值以查看模型有多好時，perplexity 無法計算，但有以下異常。 如果我使用 Gensim 的內置 LDA 模型而不是 Mallet，我不會得到同樣的錯誤。 我的語料庫包含 700 萬個文檔，長度不超過 50 個單詞，平均 20 個單詞。所以文檔很短。

以下是我的代碼的相關部分：

# TOPIC MODELING

from gensim.models import CoherenceModel
num_topics = 50

# Build Gensim's LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                       random_state=100,
                                       update_every=1,
                                       chunksize=100,
                                       passes=10,
                                       alpha='auto',
                                       per_word_topics=True)

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

困惑度：-47.91929228302663

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

連貫性得分：0.28852857563541856

LDA 給出了沒有問題的分數。 現在我用 MALLET 模擬同樣的詞袋

# Building LDA Mallet Model
mallet_path = '~/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, 
corpus=corpus, num_topics=num_topics, id2word=id2word)

# Convert mallet to gensim type
mallet_model = 
gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=mallet_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

連貫性得分：0.5994123896865993

然后我詢問 Perplexity 值並得到低於警告和 NaN 值。

# Compute Perplexity
print('\nPerplexity: ', mallet_model.log_perplexity(corpus))

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1108: RuntimeWarning: 在乘法分數中遇到無效值 += np.sum((self.eta - _lambda) * Elogbeta )

困惑：難

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1109: RuntimeWarning: 在減分中遇到無效值 += np.sum(gammaln(_lambda) - gammaln(self. eta))

我意識到這是一個非常特定於 Gensim 的問題，需要對這個函數有更深入的了解：gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

因此，對於警告和 Gensim 域的任何評論，我將不勝感激。

Answer 1

我認為沒有為 Mallet 包裝器實現 perplexity 函數。 正如Radims answer 中提到的，困惑會顯示到標准輸出：

AFAIR，Mallet 顯示了對標准輸出的困惑——這對你來說足夠了嗎？ 以編程方式捕獲這些值也應該是可能的，但我還沒有研究過。 希望 Mallet 也有一些 API 調用 perplexity eval，但它肯定沒有包含在包裝器中。

我只是在示例語料庫上運行它，並且 LL/token 確實每隔這么多次迭代就打印一次：

LL/代幣：-9.45493

困惑度 = 2^(-LL/token) = 701.81

Answer 2

我的幾分錢。

似乎在lda_model.log_perplexity(corpus) ，您使用的語料庫與用於訓練的語料庫相同。 我可能對語料庫的保留/測試集有更好的運氣。
lda_model.log_perplexity(corpus) 不返回 Perplexity。 它返回“綁定”。 如果你想把它變成np.exp2(-bound) ，做np.exp2(-bound) 。 我為此苦苦掙扎了一段時間:)
無法使用 Mallet 包裝器報告 Perplexity afaik

使用 Mallet Perplexity 進行 Gensim 主題建模

問題描述

2 個解決方案

解決方案1
1 2019-07-02 08:57:24

解決方案2
1 2020-12-20 16:41:52

使用 Mallet Perplexity 進行 Gensim 主題建模

問題描述

2 個解決方案

解決方案1 1 2019-07-02 08:57:24

解決方案2 1 2020-12-20 16:41:52

解決方案1
1 2019-07-02 08:57:24

解決方案2
1 2020-12-20 16:41:52