简体   繁体   中英

Gensim Mallet Wrapper: How can I get all documents' topic weights?

I am using Gensim's Mallet wrapper for topic modeling -

LdaMallet(path_to_mallet_binary, corpus=corpus, num_topics=100, id2word=words, workers=6, random_seed=2)

While the above worked surprisingly fast, the step (see below) to obtain the topic distribution for each document (n=40,000) is taking a very long time.

#Store topic distributuon for all documents
all_topics=[]
for x in tqdm(range(0, len(doc_list))):
    all_topics.append(lda_model[corpus[x]])

It has taken ~18 hours to complete 30,000 documents. Not sure what I am doing incorrectly. Is there a way to get topic distribution for all documents much faster?

I was able to speed-up by directly calling the Java mallet through Python's subprocess . The doc-topics distribution are available in a file that can be easily imported to a dataframe. The gensim wrapper is although straightforward, seems to have issues.

it turns out the time was took by loading the LdaMallet model mostly, I was able to generate 50,000 topic distributions in just 4 mins when I did it once for all instead of doing one by one (it took the same time before as you did).

corpus = [dictionary.doc2bow(preprocess(unseen_document)) for unseen_document in unseen_documents] distributions = mallet_model[corpus]

You could refer to https://github.com/RaRe-Technologies/gensim/issues/3018

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM