Gensim Mallet Wrapper：如何获取所有文档的主题权重？

Question

I am using Gensim's Mallet wrapper for topic modeling -我正在使用 Gensim 的 Mallet 包装器进行主题建模 -

LdaMallet(path_to_mallet_binary, corpus=corpus, num_topics=100, id2word=words, workers=6, random_seed=2)

While the above worked surprisingly fast, the step (see below) to obtain the topic distribution for each document (n=40,000) is taking a very long time.尽管上述方法运行得非常快，但获取每个文档 (n=40,000) 的主题分布的步骤（见下文）需要很长时间。

#Store topic distributuon for all documents
all_topics=[]
for x in tqdm(range(0, len(doc_list))):
    all_topics.append(lda_model[corpus[x]])

It has taken ~18 hours to complete 30,000 documents.完成 30,000 份文件大约需要 18 个小时。 Not sure what I am doing incorrectly.不知道我做错了什么。 Is there a way to get topic distribution for all documents much faster?有没有办法更快地获得所有文档的主题分布？

Answer 1

I was able to speed-up by directly calling the Java mallet through Python's subprocess .我可以通过 Python 的subprocess直接调用 Java mallet来加速。 The doc-topics distribution are available in a file that can be easily imported to a dataframe.文档主题分布在一个文件中可用，该文件可以轻松导入到 dataframe。 The gensim wrapper is although straightforward, seems to have issues. gensim包装器虽然简单，但似乎有问题。

Answer 2

it turns out the time was took by loading the LdaMallet model mostly, I was able to generate 50,000 topic distributions in just 4 mins when I did it once for all instead of doing one by one (it took the same time before as you did).事实证明，大部分时间是加载 LdaMallet model 所花费的时间，当我一次性完成而不是一一完成时，我能够在短短 4 分钟内生成 50,000 个主题分布（它与您之前花费的时间相同） .

corpus = [dictionary.doc2bow(preprocess(unseen_document)) for unseen_document in unseen_documents] distributions = mallet_model[corpus] corpus = [dictionary.doc2bow(preprocess(unseen_document)) for unseen_documents in unseen_documents] 分布 = mallet_model [语料库]

You could refer to https://github.com/RaRe-Technologies/gensim/issues/3018您可以参考https://github.com/RaRe-Technologies/gensim/issues/3018

Gensim Mallet Wrapper：如何获取所有文档的主题权重？

问题描述

2 个解决方案

解决方案1
0 2020-06-18 07:20:48

解决方案2
0 2021-01-03 16:02:43

Gensim Mallet Wrapper：如何获取所有文档的主题权重？

问题描述

2 个解决方案

解决方案1 0 2020-06-18 07:20:48

解决方案2 0 2021-01-03 16:02:43

解决方案1
0 2020-06-18 07:20:48

解决方案2
0 2021-01-03 16:02:43