简体繁体中英

Fast way to determine the optimal number of topics for a large corpus using LDA

原文 2018-07-05 07:15:37 7 1 python/ r/ lda/ topic-modeling

I have a corpus consisting of around 160,000 documents. I want to do a topic modeling on it using LDA in R (specifically the function lda.collapsed.gibbs.sampler in lda package).

I want to determine the optimal number of topics. It seems the common procedure is to have a vector of topic numbers, eg, from 1 to 100, then run the model for 100 times and the find the one has the largest harmonic mean or samllest perplexity.

However, given the large amount of documents, the optimal number of topics can easily go to several hundreds or even thousands. I find that as the number of topic increases, the computation time grows significantly. Even if I use parallel computing, it will several days or weeks.

I wonder is there a better (time-efficient) way to choose the optimal number of topics? or is there any suggestion to reduce the computation time?

Any suggestion is welcomed.

1 answers

Start with some guess in middle. decrease and increase the number of topics by say 50 or 100 instead of 1. Check in which way Coherence Score is increasing. I am sure it will converge.

What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?

Calculating optimal number of topics for topic modeling (LDA)

LDA Topic Modelling : Topics predicted from huge corpus make no sense

LDA model generates different topics everytime i train on the same corpus

sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets

Applying LDA to a corpus for training using gensim

Unable to classify topics using LDA trained model

How to define the optimal number of topics (k)?

Can we use a self made corpus for training for LDA using gensim?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim? Calculating optimal number of topics for topic modeling (LDA) LDA Topic Modelling : Topics predicted from huge corpus make no sense LDA model generates different topics everytime i train on the same corpus sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets Applying LDA to a corpus for training using gensim Get the recommended number of topics from an LDA model using pyspark.ml Unable to classify topics using LDA trained model How to define the optimal number of topics (k)? Can we use a self made corpus for training for LDA using gensim?

Related Tags

Fast way to determine the optimal number of topics for a large corpus using LDA

Question

1 answers

solution1 0 2018-07-07 19:39:55

solution1
0 2018-07-07 19:39:55