简体   繁体   中英

Working on google collab with python : garbage collector is not working?

I am working on google collab using python and I have a 12Gb Ram. I am trying to use word2vec pre-trained by google to represent sentences by vectors. I should have same length vectors even if they do not have the same number of words so I used padding (the maximum length of a sentence here is my variable max) The problem is that every time I want to create a matrix containing all of my vectors i run out of RAM memory quickly (on 20k th / 128k vector)

This is my code :

final_x_train = []
l=np.zeros((max,300)) # The legnth of a google pretained model is 300 
for i in new_X_train: 
    buildWordVector(final_x_train, i, model, l)
    gc.collect() #doesn't do anything except slowing the run time


def buildWordVector(new_X, sent, model, l):    
    for x in range(len(sent)):
        try:
            l[x]= list(model[sent[x]])
            gc.collect() #doesn't do anything except slowing the run time
    except KeyError:
        continue
    new_X.append([list(x) for x in l])

all the variable that i have :

     df:  16.8MiB
     new_X_train: 1019.1KiB
     X_train: 975.5KiB
     y_train: 975.5KiB
     new_X_test: 247.7KiB
     X_test: 243.9KiB
     y_test: 243.9KiB
     l: 124.3KiB
     final_x_train:  76.0KiB
     stop_words:   8.2KiB

But I am at 12Gb/12Gb (RAM) and the session has expired

As you can see the garbage collector is not doing anything because apperently is cannot see the variables but I really need a solution to solve this problem can anyone help me please?

In general in a garbage-collected language like Python you don't need to explicitly request garbage-collection: it happens automatically when you've stopped retaining references (variables/transitive-property-references) to objects.

So, if you're getting a memory error here, it's almost certainly because you're really trying to use more than the available amount of memory at a time.

Your code is a bit incomplete and unclear – what is max ? what is new_X_train ? where are you getting those memory sizing estimates? etc.

But notably: it's not typical to represent a sentence as a concatenation of each word's vector. (So that, with 300d word-vectors, and an up-to-10-word sentence, you have a 3000d sentence-vector.) It's far more common to average the word-vectors together, so both words and sentences have the same size, and there's no blank padding at the end of short sentences.

(That's still a very crude way to create text-vectors, but more common than padding-to-maximum-sentence-size.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM