简体   繁体   中英

How to enable multicore processing with sklearn LDA?

I have a topics model using sklearn LDA . My corpus have ~75K documents and matrix shape generate from corpus is X.shape = (74645, 91542)

When I pass this matrix to sklearn LDA it takes 3 hrs on my local and on server it is using 11 hrs .

So my question is:

Is there a way to use multicore processing in sklearn LDA? or is there a way to reduce my processing time significantly?

Any help will be much appreciated.

Please take a look at the code:

line that generated lda_output takes hours to run

vectorizer = CountVectorizer(stop_words='english', ngram_range= (1,2), vocabulary = word_list)
X = vectorizer.fit_transform(documents)

lda_model = LatentDirichletAllocation(n_components=50,            # Number of topics
                                      learning_decay=0.7,
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )

#--Because before this line system was running out of memory

%env JOBLIB_TEMP_FOLDER=/tmp

start_time = datetime.datetime.now()

lda_output = lda_model.fit_transform(X)

end_time = datetime.datetime.now()

run_time_lda = end_time - start_time

#output:
#datetime.timedelta(0, 38157, 730304) ~ 11hrs

you might want to rethink about your vocabulary word_list , which seems to be bigger than your documents count. Try building the vocabulary from the documents itself, if it can work in your problem.

Also specify the min_df to remove the very low frequency words. May be lemmatization/ stemming could be of use to reduce the vocabulary size and it would also helps the lda to learn better topics.

I would recommend not to use bigrams/trigrams for lda modelling, because it might lead to uninterpretable model.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM