简体繁体 English

Gensim LdaMulticore 没有正确进行多处理（仅使用 4 个工人）

[英]Gensim LdaMulticore is not multiprocessing properly (using just 4 workers)

原文 2016-02-12 22:01:00 6 1 python/ lda/ gensim/ topic-modeling

I am using Gensim's LDAMulticore to perform LDA.我正在使用 Gensim 的 LDAMulticore 来执行 LDA。 I have around 28M small documents (around 100 characters each).我有大约 2800 万个小文档（每个大约 100 个字符）。

I have given workers argument to be 20 but the top shows it using only 4 processes.我给工人的参数是 20，但顶部显示它只使用了 4 个进程。 There are some discussions around it that it might be slow in reading corpus like: gensim LdaMulticore not multiprocessing?有一些围绕它的讨论，它在阅读语料库时可能会很慢，例如： gensim LdaMulticore not multiprocessing？ https://github.com/piskvorky/gensim/issues/288 https://github.com/piskvorky/gensim/issues/288

But both of them uses MmCorpus .但他们都使用 MmCorpus 。 Although my corpus is completely in memory.虽然我的语料库完全在记忆中。 I have machine with very large RAM (250 GB) and loading the corpus in memory takes around 40 GB.我的机器有非常大的 RAM (250 GB)，在内存中加载语料库需要大约 40 GB。 But even after that LDAMulticore is using just 4 processes.但即使在那之后 LDAMulticore 也只使用了 4 个进程。 I created the corpus as:我将语料库创建为：

corpus = [dictionary.doc2bow(text) for text in texts]

I am not able to understand what can be the limiting factor here?我无法理解这里的限制因素是什么？

1 个解决方案

I would check what is the batch size you use我会检查您使用的批量大小是多少

I found that in cases the Batch X n_workers is greater than number of documents , I cannot utilize all the available workers I have.我发现在Batch X n_workers大于文档数的情况下，我无法利用我拥有的所有可用工人。 This make sense as you give each worker a number of docs per pass.这是有道理的，因为您为每个工作人员每次通过提供了许多文档。 You might "starve" some of them if the batch value is not considered.如果不考虑批次值，您可能会“饿死”其中的一些。

I am not sure it solves your specific problem, but is indeed the reason many people mentioned the multicore does not "work" as expected in terms of multiprocessing我不确定它是否解决了您的具体问题，但这确实是许多人提到多核在多处理方面没有按预期“工作”的原因