主題建模內存錯誤：如何在有大量數據時進行 gensim 主題建模

Question

我在使用大量數據進行主題建模時遇到了問題。 我正在嘗試做我以前做過的 LDA 和 NMF 主題建模，但不是我目前正在使用的大量數據。 主要問題是我無法在創建模型的同時將所有數據保存在內存中。

我需要模型和相關指標。 這是我目前如何制作模型的代碼

def make_lda(dictionary, corpus, num_topics):
    passes = 3

    # Make a index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token

    model = LdaMulticore(
        corpus=corpus,
        id2word=id2word,
        passes=passes,
        num_topics=num_topics
    )
    
    return model

def make_nmf(dictionary, corpus, num_topics):
    
    passes = 3

    # Make a index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token
    
    model = Nmf(
        corpus=corpus,
        id2word=id2word,
        passes=passes,
        num_topics=num_topics
    )
    
    return model

這是我如何獲得一致性度量和其他一些統計數據

def get_model_stats(model, model_type, docs, dictionary, corpus, num_topics, verbose=False, get_topics=False):
    if model_type == 'lda':
        top_topics = model.top_topics(texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)
    elif model_type == 'nmf':
        top_topics = model.top_topics(corpus=corpus, texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)

    # Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
    avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
    rstd_atc = np.std([t[1] for t in top_topics]) / avg_topic_coherence
  
    if verbose:
        print('Average topic coherence: ', avg_topic_coherence)
        print('Relative Standard Deviation of ATC: ', rstd_atc)
    
    if get_topics:
        return avg_topic_coherence, rstd_atc, top_topics
    
    return avg_topic_coherence, rstd_atc

如您所見，我需要在不同時間將我的字典、文本、語料庫和 id2token 對象存儲在內存中，有時需要同時存儲。 但我不能這樣做，因為像我的文本之類的東西會占用大量內存。 我的機器不夠用。

我知道我可以花錢購買具有大量 RAM 的虛擬機，但我想知道是否有更好的解決方案。 我可以將所有數據存儲在磁盤上。 如果數據不在內存中，有沒有辦法運行這些模型？ 有沒有其他解決方案可以讓我的內存不超載？

Answer 1

有一些您可以使用的小調整可能不會產生太大影響（例如將列表理解更改為生成器 - 例如在總結時）但這是一個通用的內存節省提示，所以我認為值得一提。

您可以獲得的顯着差異是對Dictionary使用一些更積極的修剪。 默認參數是prune_at=200000 。 如果您有大量文檔，您可能希望將閾值降低到某個較低的值。

另一件事是將filter_extremes函數應用於創建的字典，以刪除不太可能對結果產生影響的單詞。 在這里您可以再次更積極地設置參數：

no_below – 保留至少包含在no_below文檔中的令牌。

no_above – 保留包含在不超過no_above文檔中的標記（總語料庫大小的一部分，不是絕對數字）。

keep_n – 只保留第一個keep_n最頻繁的令牌。

最重要的是，您可能希望每隔一段時間調用一次垃圾收集器（例如在運行make_nmf函數之前）：

import gc
gc.collect()

並且肯定不要並行運行make_nmf和make_lda （你可能沒有這樣做，但我想強調它，因為我們沒有看到你的整個代碼）。

調整這些值可以幫助您減少所需的內存占用並保持最佳模型。

Answer 2

您沒有展示您的corpus （或docs / texts ）是如何創建的，但是使用 Gensim 需要記住的最重要的一點是，整個訓練集基本上永遠不必一次在內存中（就像一個巨大的list ）。

相反，你可以（為當內存是一個可能的問題，應任何大型語料庫）提供它作為一個重新迭代Python的序列中，只有讀取的要求底層存儲的各個項目。 使用 Python生成器通常是這種方法的關鍵部分（但不是全部）。

Gensim 包的原始創建者有一篇博客文章介紹了基礎知識：“ Python 中的數據流：生成器、迭代器、可迭代對象”

主題建模內存錯誤：如何在有大量數據時進行 gensim 主題建模

問題描述

2 個解決方案

解決方案1
0 2020-09-07 05:31:04

解決方案2
0 2020-09-07 17:57:29

主題建模內存錯誤：如何在有大量數據時進行 gensim 主題建模

問題描述

2 個解決方案

解決方案1 0 2020-09-07 05:31:04

解決方案2 0 2020-09-07 17:57:29

解決方案1
0 2020-09-07 05:31:04

解決方案2
0 2020-09-07 17:57:29