简体   繁体   中英

MemoryError: unable to allocate array with shape and data type float32 while using word2vec in python

I am trying to train the word2vec model from Wikipedia text data, for that I am using following code.

import logging
import os.path
import sys
import multiprocessing

from gensim.corpora import  WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence


if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments

    if len(sys.argv) < 3:
        print (globals()['__doc__'])
        sys.exit(1)
    inp, outp = sys.argv[1:3]

    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())

    # trim unneeded model memory = use (much) less RAM
    model.init_sims(replace=True)

    model.save(outp)

But after 20 minutes of program running, I am getting following error

Error message

Ideally, you should paste the text of your error into your question, rather than a screenshot. However, I see the two key lines:

<TIMESTAMP> : INFO : estimated required memory for 2372206 words and 400 dimensions: 8777162200 bytes
...
MemoryError: unable to allocate array with shape (2372206, 400) and data type float32

After making one pass over your corpus, the model has learned how many unique words will survive, which reports how large of a model must be allocated: one taking about 8777162200 bytes (about 8.8GB). But, when trying to allocate the required vector array, you're getting a MemoryError , which indicates not enough computer addressable-memory (RAM) is available.

You can either:

  1. run where there's more memory, perhaps by adding RAM to your existing system; or
  2. reduce the amount of memory required, chiefly by reducing either the number of unique word-vectors you'd like to train, or their dimensional size.

You could reduce the number of words by increasing the default min_count=5 parameter to something like min_count=10 or min_count=20 or min_count=50 . (You probably don't need over 2 million word-vectors – many interesting results are possible with just a vocabulary of a few tens-of-thousands of words.)

You could also set a max_final_vocab value, to specify an exact number of unique words to keep. For example, max_final_vocab=500000 would keep just the 500000 most-frequent words, ignoring the rest.

Reducing the size will also save memory. A setting of size=300 is popular for word-vectors, and would reduce the memory requirements by a quarter.

Together, using size=300, max_final_vocab=500000 should trim the required memory to under 2GB.

I encountered the same problem while working on pandas dataframe, i solved it by converting float64 types to unint8 ( of course for those not necessarily needs to be float64, you can try float32 instead of 64)

data['label'] = data['label'].astype(np.uint8)

if you encounter conversion errors

data['label'] = data['label'].astype(np.uint8,errors='ignore')

I don´t know if it works in this case, but you can increase the amount of Virtual RAM in your system by using the space in a SSD. It worked for me in different projects when the needed RAM to run the algorithms was too high.

-Go to the Start Menu and click on Settings. -Type performance. -Choose Adjust the appearance and performance of Windows. -In the new window, go to the Advanced tab and under the Virtual memory section, click on Change. -At the bottom of the new window, check what the Recommended value is and how it compares to Currently allocated. You can go above the recommended value

在尝试了许多修复(例如修改虚拟内存和重新安装 python)之后,对我有用的是将 numpy dtype 从默认的 float64 修改为 float32。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM