[英]Topic Modeling Memory Error: How to do gensim topic modelling when with large amounts of data
I'm having an issue topic modeling with a lot of data.我在使用大量数据进行主题建模时遇到了问题。 I am trying to do both LDA and NMF topic modeling which I have done before, but not with the great volume of data I am currently working with.
我正在尝试做我以前做过的 LDA 和 NMF 主题建模,但不是我目前正在使用的大量数据。 The main issue is that i can't hold all my data in memory while also creating the models.
主要问题是我无法在创建模型的同时将所有数据保存在内存中。
I need both the models and associated metrics.我需要模型和相关指标。 Here is the code for how i make my models currently
这是我目前如何制作模型的代码
def make_lda(dictionary, corpus, num_topics):
passes = 3
# Make a index to word dictionary.
temp = dictionary[0] # This is only to "load" the dictionary.
id2word = dictionary.id2token
model = LdaMulticore(
corpus=corpus,
id2word=id2word,
passes=passes,
num_topics=num_topics
)
return model
def make_nmf(dictionary, corpus, num_topics):
passes = 3
# Make a index to word dictionary.
temp = dictionary[0] # This is only to "load" the dictionary.
id2word = dictionary.id2token
model = Nmf(
corpus=corpus,
id2word=id2word,
passes=passes,
num_topics=num_topics
)
return model
And here is how I get the coherence measures and some other statistics这是我如何获得一致性度量和其他一些统计数据
def get_model_stats(model, model_type, docs, dictionary, corpus, num_topics, verbose=False, get_topics=False):
if model_type == 'lda':
top_topics = model.top_topics(texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)
elif model_type == 'nmf':
top_topics = model.top_topics(corpus=corpus, texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)
# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
rstd_atc = np.std([t[1] for t in top_topics]) / avg_topic_coherence
if verbose:
print('Average topic coherence: ', avg_topic_coherence)
print('Relative Standard Deviation of ATC: ', rstd_atc)
if get_topics:
return avg_topic_coherence, rstd_atc, top_topics
return avg_topic_coherence, rstd_atc
As you can see, I need my dictionary, texts, corpus, and id2token objects in memory at different times, sometimes all at the same time.如您所见,我需要在不同时间将我的字典、文本、语料库和 id2token 对象存储在内存中,有时需要同时存储。 But I can't do that since something like my texts use up a ton of memory.
但我不能这样做,因为像我的文本之类的东西会占用大量内存。 My machine just does not have enough.
我的机器不够用。
I know I can pay to get a virtual machine with crazy amounts of RAM, but I want to know if there is a better solution.我知道我可以花钱购买具有大量 RAM 的虚拟机,但我想知道是否有更好的解决方案。 I can store all of my data on disk.
我可以将所有数据存储在磁盘上。 Is there a way to run these models were the data is not in memory?
如果数据不在内存中,有没有办法运行这些模型? Is there some other solution where I don't overload my memory?
有没有其他解决方案可以让我的内存不超载?
There are some small tweaks that you can potentially use that will likely do not make much difference (eg changing lists comprehensions into generators - eg when summing up) but this is a general memory-saving hint so I thought it is worth mentioning it.有一些您可以使用的小调整可能不会产生太大影响(例如将列表理解更改为生成器 - 例如在总结时)但这是一个通用的内存节省提示,所以我认为值得一提。
Out of notable differences you can get is to use some more aggressive pruning on the Dictionary
.您可以获得的显着差异是对
Dictionary
使用一些更积极的修剪。 The default parameter is to prune_at=200000
.默认参数是
prune_at=200000
。 You may want to lower the threshold to some lower value if you have plenty of documents.如果您有大量文档,您可能希望将阈值降低到某个较低的值。
Another thing to do is to apply filter_extremes
function to the created dictionary to remove words that are unlikely to have influence on the results.另一件事是将
filter_extremes
函数应用于创建的字典,以删除不太可能对结果产生影响的单词。 Here you can set up the parameters more aggressively again:在这里您可以再次更积极地设置参数:
no_below
– Keep tokens which are contained in at leastno_below
documents.no_below
– 保留至少包含在no_below
文档中的令牌。
no_above
– Keep tokens which are contained in no more thanno_above
documents (fraction of total corpus size, not an absolute number).no_above
– 保留包含在不超过no_above
文档中的标记(总语料库大小的一部分,不是绝对数字)。
keep_n
– Keep only the firstkeep_n
most frequent tokens.keep_n
– 只保留第一个keep_n
最频繁的令牌。
On top of that you may want to call garbage collector every once in a while (eg before running make_nmf
function):最重要的是,您可能希望每隔一段时间调用一次垃圾收集器(例如在运行
make_nmf
函数之前):
import gc
gc.collect()
And for sure do not run make_nmf
and make_lda
in parallel (you are probably not doing that but I wanted to highlight it because we do not see your whole code).并且肯定不要并行运行
make_nmf
和make_lda
(你可能没有这样做,但我想强调它,因为我们没有看到你的整个代码)。
Tweaking these values can help you reduce the memory footprint desired and maintain the best possible model.调整这些值可以帮助您减少所需的内存占用并保持最佳模型。
You don't show how your corpus
(or docs
/ texts
) is created, but the single most important thing to remember with Gensim is that entire training sets essentially never have to be in-memory at once (as with a giant list
).您没有展示您的
corpus
(或docs
/ texts
)是如何创建的,但是使用 Gensim 需要记住的最重要的一点是,整个训练集基本上永远不必一次在内存中(就像一个巨大的list
)。
Rather, you can (& for any large corpus when memory is a possible issue should ) provide it as a re-iterable Python sequence, that only reads individual items from underlying storage as requested.相反,你可以(为当内存是一个可能的问题,应任何大型语料库)提供它作为一个重新迭代Python的序列中,只有读取的要求底层存储的各个项目。 Using a Python generator is usually a key part (but the not the whole story) of such an approach.
使用 Python生成器通常是这种方法的关键部分(但不是全部)。
The original creator of the Gensim package has a blog post going over the basics: " Data streaming in Python: generators, iterators, iterables " Gensim 包的原始创建者有一篇博客文章介绍了基础知识:“ Python 中的数据流:生成器、迭代器、可迭代对象”
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.