[英]How to calculate perplexity for LDA with Gibbs sampling
I perform an LDA topic model in R on a collection of 200+ documents (65k words total). 我在R上的LDA主题模型上处理了200多个文档(共65k个字)的集合。 The documents have been preprocessed and are stored in the document-term matrix dtm
. 文档已经过预处理,并存储在文档项矩阵dtm
。 Theoretically, I should expect to find 5 distinct topics in the corpus, but I would like to calculate the perplexity score and see how the model fit changes with the number of topics. 从理论上讲,我应该期望在语料库中找到5个不同的主题,但是我想计算困惑度得分,并查看模型如何随着主题数量的变化而变化。 Below is the code I use. 下面是我使用的代码。 The problem is it gives me an error when i try to calculate the perplexity score and I am not sure how to fix it (I am new to R). 问题是,当我尝试计算困惑度分数时,它给了我一个错误,我不确定如何解决(我是R的新手)。 The error is in the last line of code. 错误在代码的最后一行。 I would appreciate any help. 我将不胜感激任何帮助。
burnin <- 4000 #burn-in parameter
iter <- 2000 # #of iteration after burn-in
thin <- 500 #take every 500th iteration for further use to avoid correlations between samples
seed <-list(2003,10,100,10005,765)
nstart <- 5 #use 5 different starting points
best <- TRUE #return results of the run with the highest posterior probability
#Number of topics (run the algorithm for different values of k and make a choice based by inspecting the results)
k <- 5
#Run LDA using Gibbs sampling
ldaOut <-LDA(dtm,k, method="Gibbs",
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
perplexity(ldaOut, newdata = dtm)
Error in method(x, k, control, model, mycall, ...) : Need 1 seeds
It needs one more parameter "estimate_theta", 它还需要一个参数“ estimate_theta”,
use below code: 使用以下代码:
perplexity(ldaOut, newdata = dtm,estimate_theta=FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.