主题建模内存错误：如何在有大量数据时进行 gensim 主题建模

Question

我在使用大量数据进行主题建模时遇到了问题。 我正在尝试做我以前做过的 LDA 和 NMF 主题建模，但不是我目前正在使用的大量数据。 主要问题是我无法在创建模型的同时将所有数据保存在内存中。

我需要模型和相关指标。 这是我目前如何制作模型的代码

def make_lda(dictionary, corpus, num_topics):
    passes = 3

    # Make a index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token

    model = LdaMulticore(
        corpus=corpus,
        id2word=id2word,
        passes=passes,
        num_topics=num_topics
    )
    
    return model

def make_nmf(dictionary, corpus, num_topics):
    
    passes = 3

    # Make a index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token
    
    model = Nmf(
        corpus=corpus,
        id2word=id2word,
        passes=passes,
        num_topics=num_topics
    )
    
    return model

这是我如何获得一致性度量和其他一些统计数据

def get_model_stats(model, model_type, docs, dictionary, corpus, num_topics, verbose=False, get_topics=False):
    if model_type == 'lda':
        top_topics = model.top_topics(texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)
    elif model_type == 'nmf':
        top_topics = model.top_topics(corpus=corpus, texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)

    # Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
    avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
    rstd_atc = np.std([t[1] for t in top_topics]) / avg_topic_coherence
  
    if verbose:
        print('Average topic coherence: ', avg_topic_coherence)
        print('Relative Standard Deviation of ATC: ', rstd_atc)
    
    if get_topics:
        return avg_topic_coherence, rstd_atc, top_topics
    
    return avg_topic_coherence, rstd_atc

如您所见，我需要在不同时间将我的字典、文本、语料库和 id2token 对象存储在内存中，有时需要同时存储。 但我不能这样做，因为像我的文本之类的东西会占用大量内存。 我的机器不够用。

我知道我可以花钱购买具有大量 RAM 的虚拟机，但我想知道是否有更好的解决方案。 我可以将所有数据存储在磁盘上。 如果数据不在内存中，有没有办法运行这些模型？ 有没有其他解决方案可以让我的内存不超载？

Answer 1

有一些您可以使用的小调整可能不会产生太大影响（例如将列表理解更改为生成器 - 例如在总结时）但这是一个通用的内存节省提示，所以我认为值得一提。

您可以获得的显着差异是对Dictionary使用一些更积极的修剪。 默认参数是prune_at=200000 。 如果您有大量文档，您可能希望将阈值降低到某个较低的值。

另一件事是将filter_extremes函数应用于创建的字典，以删除不太可能对结果产生影响的单词。 在这里您可以再次更积极地设置参数：

no_below – 保留至少包含在no_below文档中的令牌。

no_above – 保留包含在不超过no_above文档中的标记（总语料库大小的一部分，不是绝对数字）。

keep_n – 只保留第一个keep_n最频繁的令牌。

最重要的是，您可能希望每隔一段时间调用一次垃圾收集器（例如在运行make_nmf函数之前）：

import gc
gc.collect()

并且肯定不要并行运行make_nmf和make_lda （你可能没有这样做，但我想强调它，因为我们没有看到你的整个代码）。

调整这些值可以帮助您减少所需的内存占用并保持最佳模型。

Answer 2

您没有展示您的corpus （或docs / texts ）是如何创建的，但是使用 Gensim 需要记住的最重要的一点是，整个训练集基本上永远不必一次在内存中（就像一个巨大的list ）。

相反，你可以（为当内存是一个可能的问题，应任何大型语料库）提供它作为一个重新迭代Python的序列中，只有读取的要求底层存储的各个项目。 使用 Python生成器通常是这种方法的关键部分（但不是全部）。

Gensim 包的原始创建者有一篇博客文章介绍了基础知识：“ Python 中的数据流：生成器、迭代器、可迭代对象”

主题建模内存错误：如何在有大量数据时进行 gensim 主题建模

问题描述

2 个解决方案

解决方案1
0 2020-09-07 05:31:04

解决方案2
0 2020-09-07 17:57:29

主题建模内存错误：如何在有大量数据时进行 gensim 主题建模

问题描述

2 个解决方案

解决方案1 0 2020-09-07 05:31:04

解决方案2 0 2020-09-07 17:57:29

解决方案1
0 2020-09-07 05:31:04

解决方案2
0 2020-09-07 17:57:29