[英]Python implementation of LDA topic model with Gibbs sampling and burnin + thin options?
I am trying to optimize an LDA topic model using collapsed Gibbs sampling. 我正在尝试使用折叠的Gibbs采样来优化LDA主题模型。 I have been using the
ldatuning
package in R to optimize the number of topics k: 我一直在R中使用
ldatuning
包来优化主题数k:
controls_tm <- list(
burnin = 1000,
iter = 4000,
thin = 500,
nstart = 5,
seed = 0:4,
best = TRUE
)
num_cores <- max(parallel::detectCores() - 1, 1)
result <- FindTopicsNumber(my_dfm, topics = seq(40, 100, by = 5), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), mc.cores = num_cores, control = controls_tm, verbose = TRUE)
This is all fine. 一切都很好。 Now I can run
topicmodels
in R for a given k with the same controls but it takes ~8 hours to run per model, even on a HPC cluster with 27 cores. 现在,我可以使用相同的控件在R中为给定的k运行
topicmodels
,但是即使在具有27个核心的HPC集群上,每个模型也需要大约8个小时来运行。 I am looking for Python implementations of LDA topic models that I can run with the same controls so that it is consistent with what I used to optimize ldatuning
, but faster, because I need to run multiple models to compare perplexity. 我正在寻找可以使用相同控件运行的LDA主题模型的Python实现,以使其与用于优化
ldatuning
优化程序ldatuning
,但速度更快,因为我需要运行多个模型以比较复杂性。
I have looked at the lda
library in Python which uses Gibbs and takes <1 hour per model. 我看过Python中的
lda
库, lda
库使用Gibbs,每个模型花费的时间少于1小时。 But as far as I can tell, I cannot pass it the burnin or thin parameters. 但是据我所知,我无法将其传递为burnin或thin参数。
I have also looked at sklearn.decomposition.LatentDirichletAllocation
but it uses variational Bayes instead of Gibbs and it also doesn't look like it accepts burnin or thin anyway. 我也看过
sklearn.decomposition.LatentDirichletAllocation
但是它使用变型Bayes而不是Gibbs,而且它看起来也不像接受Burnin或Thin。 Same goes for gensim
(I think -- I am not very familiar with it). gensim
(我认为-我不太熟悉)。
Does this just not exist in Python? 这在Python中不存在吗? Or is there a workaround so that I can run a model in Python with Gibbs sampling and the parameters I want?
还是有一种解决方法,以便我可以使用Gibbs采样和所需参数在Python中运行模型? Thanks!
谢谢!
Assuming you won't use online training, have you checked gensim's Latent Dirichlet Allocation via Mallet ? 假设您不使用在线培训,您是否通过Mallet检查了gensim的潜在Dirichlet分配 ?
This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents, using an (optimized version of) collapsed gibbs sampling from MALLET.
该模块使用来自MALLET的折叠吉布斯采样(的优化版本),可以从训练语料库估计LDA模型,还可以推断出新的,看不见的文档上的主题分布。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.