简体   繁体   English

具有Gibbs采样和Burnin + Thin选项的LDA主题模型的Python实现?

[英]Python implementation of LDA topic model with Gibbs sampling and burnin + thin options?

I am trying to optimize an LDA topic model using collapsed Gibbs sampling. 我正在尝试使用折叠的Gibbs采样来优化LDA主题模型。 I have been using the ldatuning package in R to optimize the number of topics k: 我一直在R中使用ldatuning包来优化主题数k:

controls_tm <- list(
  burnin = 1000,
  iter = 4000,
  thin = 500,
  nstart = 5,
  seed = 0:4,
  best = TRUE
)

   num_cores <- max(parallel::detectCores() - 1, 1)

   result <- FindTopicsNumber(my_dfm, topics = seq(40, 100, by = 5), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), mc.cores = num_cores, control = controls_tm, verbose = TRUE)

This is all fine. 一切都很好。 Now I can run topicmodels in R for a given k with the same controls but it takes ~8 hours to run per model, even on a HPC cluster with 27 cores. 现在,我可以使用相同的控件在R中为给定的k运行topicmodels ,但是即使在具有27个核心的HPC集群上,每个模型也需要大约8个小时来运行。 I am looking for Python implementations of LDA topic models that I can run with the same controls so that it is consistent with what I used to optimize ldatuning , but faster, because I need to run multiple models to compare perplexity. 我正在寻找可以使用相同控件运行的LDA主题模型的Python实现,以使其与用于优化ldatuning优化程序ldatuning ,但速度更快,因为我需要运行多个模型以比较复杂性。

I have looked at the lda library in Python which uses Gibbs and takes <1 hour per model. 我看过Python中的lda库, lda库使用Gibbs,每个模型花费的时间少于1小时。 But as far as I can tell, I cannot pass it the burnin or thin parameters. 但是据我所知,我无法将其传递为burnin或thin参数。

I have also looked at sklearn.decomposition.LatentDirichletAllocation but it uses variational Bayes instead of Gibbs and it also doesn't look like it accepts burnin or thin anyway. 我也看过sklearn.decomposition.LatentDirichletAllocation但是它使用变型Bayes而不是Gibbs,而且它看起来也不像接受Burnin或Thin。 Same goes for gensim (I think -- I am not very familiar with it). gensim (我认为-我不太熟悉)。

Does this just not exist in Python? 这在Python中不存在吗? Or is there a workaround so that I can run a model in Python with Gibbs sampling and the parameters I want? 还是有一种解决方法,以便我可以使用Gibbs采样和所需参数在Python中运行模型? Thanks! 谢谢!

Assuming you won't use online training, have you checked gensim's Latent Dirichlet Allocation via Mallet ? 假设您不使用在线培训,您是否通过Mallet检查了gensim的潜在Dirichlet分配

This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents, using an (optimized version of) collapsed gibbs sampling from MALLET. 该模块使用来自MALLET的折叠吉布斯采样(的优化版本),可以从训练语料库估计LDA模型,还可以推断出新的,看不见的文档上的主题分布。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM