简体   繁体   中英

MemoryError using Python and Doc2Vec

I'm trying to train a Doc2vec for massive data. I have a 20k files with 72GB in total, and write this code:

def train():
    onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
    data = []
    random.shuffle(onlyfiles)
    tagged_data = []
    t = 0
    try:
        for file_name in onlyfiles:
            with open(mypath+"/"+file_name, 'r', encoding="utf-8") as file:
                txt = file.read()
                tagged_data.append([word_tokenize(txt.lower()), [str(t)]])
                t+=1
    except Exception as e:
        print(t)
        return 
    print("Files Loaded")
    max_epochs = 1000
    vec_size = 500
    alpha = 0.025

    model = Doc2Vec(vector_size=vec_size,
                    alpha=alpha, workers=1,
                    min_alpha=0.00025,
                    min_count=1,
                    dm=1)

    print("Model Works")
    print("Building vocabulary")

    model.build_vocab(tagged_data)
    print("Trainning")
    for epoch in range(max_epochs):
        print("Iteration {0}".format(epoch))
        model.train(tagged_data,
                    total_examples=model.corpus_count,
                    epochs=model.iter)
        model.alpha -= 0.0002
        model.min_alpha = model.alpha

    model.save(model_name)
    print("Model Saved")

But when I run this method, this error appears: Traceback (most recent call last):

File "doc2vec.py", line 20, in train
    tagged_data.append([word_tokenize(txt.lower()), [str(t)]])
MemoryError

And only 3k files are treated. But when view memory, the python process show that only 1.7% from memory was used. Is there any parameter I can inform to python to solve? How can I fix it?

You're getting the error long before even trying Doc2Vec , so this isn't really a Doc2Vec question - it's a problem with your Python data handling. Do you have enough RAM to load 72GB of disk-data (which might expand a bit when represented in Python string objects) into RAM?

But also, you won't usually have to bring an entire corpus into memory, by appending to a giant list, to do any of these tasks. Read things one at a time, and process from an iterable/iterator, perhaps writing interim results (like tokenized text) back to the IO sources. This article may be helpful:

https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Finally, if your code did proceed to the Doc2Vec section, you'd have other problems. Whatever online example you're consulting as a model has many bad practices. For example:

  • a typical interation-count is 10-20; you certainly wouldn't use 1000 for a 72GB dataset

  • min_count=1 leads to a much bigger model; usually discarding low-frequency words is necessary and may even improve resulting vector quality, and larger datasets (and 72GB is very very big) tend to user larger rather than minimal min_count settings

  • most people shouldn't be using non-default alpha / min_alpha values, or trying to manage them with their own calculations, or even calling train() more than once. train() has its own epochs parameter which if used will smoothly handle the learning-rate alpha for you. As far as I can tell, 100% of the people who call train() multiple times in their own loop are doing it wrong, and I have no idea where they keep getting these examples.

  • Training goes much slower with workers=1 ; especially with a large dataset you'll want to try larger workers values, and the optimal value for training throughput in gensim versions up through 3.5.0 is usually somewhere in the range from 3-12 (assuming you have at least that many CPU cores).

So your current code would probably result in a model larger than RAM, training single-thread slowly and 1000s of times more than necessary, with much of the training happening with a nonsensical negative-alpha which makes the model worse every cycle. If it miraculously didn't MemoryError during model initialization, it'd run for months or years and end up with nonsense results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM