简体   繁体   中英

Topic Modeling Memory Error: How to do gensim topic modelling when with large amounts of data

I'm having an issue topic modeling with a lot of data. I am trying to do both LDA and NMF topic modeling which I have done before, but not with the great volume of data I am currently working with. The main issue is that i can't hold all my data in memory while also creating the models.

I need both the models and associated metrics. Here is the code for how i make my models currently

def make_lda(dictionary, corpus, num_topics):
    passes = 3

    # Make a index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token

    model = LdaMulticore(
        corpus=corpus,
        id2word=id2word,
        passes=passes,
        num_topics=num_topics
    )
    
    return model

def make_nmf(dictionary, corpus, num_topics):
    
    passes = 3

    # Make a index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token
    
    model = Nmf(
        corpus=corpus,
        id2word=id2word,
        passes=passes,
        num_topics=num_topics
    )
    
    return model

And here is how I get the coherence measures and some other statistics

def get_model_stats(model, model_type, docs, dictionary, corpus, num_topics, verbose=False, get_topics=False):
    if model_type == 'lda':
        top_topics = model.top_topics(texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)
    elif model_type == 'nmf':
        top_topics = model.top_topics(corpus=corpus, texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)

    # Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
    avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
    rstd_atc = np.std([t[1] for t in top_topics]) / avg_topic_coherence
  
    if verbose:
        print('Average topic coherence: ', avg_topic_coherence)
        print('Relative Standard Deviation of ATC: ', rstd_atc)
    
    if get_topics:
        return avg_topic_coherence, rstd_atc, top_topics
    
    return avg_topic_coherence, rstd_atc

As you can see, I need my dictionary, texts, corpus, and id2token objects in memory at different times, sometimes all at the same time. But I can't do that since something like my texts use up a ton of memory. My machine just does not have enough.

I know I can pay to get a virtual machine with crazy amounts of RAM, but I want to know if there is a better solution. I can store all of my data on disk. Is there a way to run these models were the data is not in memory? Is there some other solution where I don't overload my memory?

There are some small tweaks that you can potentially use that will likely do not make much difference (eg changing lists comprehensions into generators - eg when summing up) but this is a general memory-saving hint so I thought it is worth mentioning it.

Out of notable differences you can get is to use some more aggressive pruning on the Dictionary . The default parameter is to prune_at=200000 . You may want to lower the threshold to some lower value if you have plenty of documents.

Another thing to do is to apply filter_extremes function to the created dictionary to remove words that are unlikely to have influence on the results. Here you can set up the parameters more aggressively again:

no_below – Keep tokens which are contained in at least no_below documents.

no_above – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).

keep_n – Keep only the first keep_n most frequent tokens.

On top of that you may want to call garbage collector every once in a while (eg before running make_nmf function):

import gc
gc.collect()

And for sure do not run make_nmf and make_lda in parallel (you are probably not doing that but I wanted to highlight it because we do not see your whole code).

Tweaking these values can help you reduce the memory footprint desired and maintain the best possible model.

You don't show how your corpus (or docs / texts ) is created, but the single most important thing to remember with Gensim is that entire training sets essentially never have to be in-memory at once (as with a giant list ).

Rather, you can (& for any large corpus when memory is a possible issue should ) provide it as a re-iterable Python sequence, that only reads individual items from underlying storage as requested. Using a Python generator is usually a key part (but the not the whole story) of such an approach.

The original creator of the Gensim package has a blog post going over the basics: " Data streaming in Python: generators, iterators, iterables "

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM