具有Gibbs采样和Burnin + Thin选项的LDA主题模型的Python实现？

Question

I am trying to optimize an LDA topic model using collapsed Gibbs sampling. 我正在尝试使用折叠的Gibbs采样来优化LDA主题模型。 I have been using the ldatuning package in R to optimize the number of topics k: 我一直在R中使用ldatuning包来优化主题数k：

controls_tm <- list(
  burnin = 1000,
  iter = 4000,
  thin = 500,
  nstart = 5,
  seed = 0:4,
  best = TRUE
)

   num_cores <- max(parallel::detectCores() - 1, 1)

   result <- FindTopicsNumber(my_dfm, topics = seq(40, 100, by = 5), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), mc.cores = num_cores, control = controls_tm, verbose = TRUE)

This is all fine. 一切都很好。 Now I can run topicmodels in R for a given k with the same controls but it takes ~8 hours to run per model, even on a HPC cluster with 27 cores. 现在，我可以使用相同的控件在R中为给定的k运行topicmodels ，但是即使在具有27个核心的HPC集群上，每个模型也需要大约8个小时来运行。 I am looking for Python implementations of LDA topic models that I can run with the same controls so that it is consistent with what I used to optimize ldatuning , but faster, because I need to run multiple models to compare perplexity. 我正在寻找可以使用相同控件运行的LDA主题模型的Python实现，以使其与用于优化ldatuning优化程序ldatuning ，但速度更快，因为我需要运行多个模型以比较复杂性。

I have looked at the lda library in Python which uses Gibbs and takes <1 hour per model. 我看过Python中的lda库， lda库使用Gibbs，每个模型花费的时间少于1小时。 But as far as I can tell, I cannot pass it the burnin or thin parameters. 但是据我所知，我无法将其传递为burnin或thin参数。

I have also looked at sklearn.decomposition.LatentDirichletAllocation but it uses variational Bayes instead of Gibbs and it also doesn't look like it accepts burnin or thin anyway. 我也看过sklearn.decomposition.LatentDirichletAllocation但是它使用变型Bayes而不是Gibbs，而且它看起来也不像接受Burnin或Thin。 Same goes for gensim (I think -- I am not very familiar with it). gensim （我认为-我不太熟悉）。

Does this just not exist in Python? 这在Python中不存在吗？ Or is there a workaround so that I can run a model in Python with Gibbs sampling and the parameters I want? 还是有一种解决方法，以便我可以使用Gibbs采样和所需参数在Python中运行模型？ Thanks! 谢谢！

Answer 1

Assuming you won't use online training, have you checked gensim's Latent Dirichlet Allocation via Mallet ? 假设您不使用在线培训，您是否通过Mallet检查了gensim的潜在Dirichlet分配？

This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents, using an (optimized version of) collapsed gibbs sampling from MALLET. 该模块使用来自MALLET的折叠吉布斯采样（的优化版本），可以从训练语料库估计LDA模型，还可以推断出新的，看不见的文档上的主题分布。

具有Gibbs采样和Burnin + Thin选项的LDA主题模型的Python实现？

问题描述

1 个解决方案

解决方案1
0 2017-12-14 21:43:37

具有Gibbs采样和Burnin + Thin选项的LDA主题模型的Python实现？

问题描述

1 个解决方案

解决方案1 0 2017-12-14 21:43:37

解决方案1
0 2017-12-14 21:43:37