I have a topics model using sklearn LDA
. My corpus have ~75K documents and matrix shape generate from corpus is X.shape = (74645, 91542)
When I pass this matrix to sklearn LDA
it takes 3 hrs on my local and on server it is using 11 hrs .
So my question is:
Any help will be much appreciated.
Please take a look at the code:
line that generated lda_output takes hours to run
vectorizer = CountVectorizer(stop_words='english', ngram_range= (1,2), vocabulary = word_list)
X = vectorizer.fit_transform(documents)
lda_model = LatentDirichletAllocation(n_components=50, # Number of topics
learning_decay=0.7,
max_iter=10, # Max learning iterations
learning_method='online',
random_state=100, # Random state
batch_size=128, # n docs in each learning iter
evaluate_every = -1, # compute perplexity every n iters, default: Don't
n_jobs = -1, # Use all available CPUs
)
#--Because before this line system was running out of memory
%env JOBLIB_TEMP_FOLDER=/tmp
start_time = datetime.datetime.now()
lda_output = lda_model.fit_transform(X)
end_time = datetime.datetime.now()
run_time_lda = end_time - start_time
#output:
#datetime.timedelta(0, 38157, 730304) ~ 11hrs
you might want to rethink about your vocabulary word_list
, which seems to be bigger than your documents count. Try building the vocabulary from the documents itself, if it can work in your problem.
Also specify the min_df
to remove the very low frequency words. May be lemmatization/ stemming could be of use to reduce the vocabulary size and it would also helps the lda to learn better topics.
I would recommend not to use bigrams/trigrams for lda modelling, because it might lead to uninterpretable model.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.