简体   繁体   中英

Fast way to determine the optimal number of topics for a large corpus using LDA

I have a corpus consisting of around 160,000 documents. I want to do a topic modeling on it using LDA in R (specifically the function lda.collapsed.gibbs.sampler in lda package).

I want to determine the optimal number of topics. It seems the common procedure is to have a vector of topic numbers, eg, from 1 to 100, then run the model for 100 times and the find the one has the largest harmonic mean or samllest perplexity.

However, given the large amount of documents, the optimal number of topics can easily go to several hundreds or even thousands. I find that as the number of topic increases, the computation time grows significantly. Even if I use parallel computing, it will several days or weeks.

I wonder is there a better (time-efficient) way to choose the optimal number of topics? or is there any suggestion to reduce the computation time?

Any suggestion is welcomed.

Start with some guess in middle. decrease and increase the number of topics by say 50 or 100 instead of 1. Check in which way Coherence Score is increasing. I am sure it will converge.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM