I am working on google collab using python and I have a 12Gb Ram. I am trying to use word2vec pre-trained by google to represent sentences by vectors. I should have same length vectors even if they do not have the same number of words so I used padding (the maximum length of a sentence here is my variable max) The problem is that every time I want to create a matrix containing all of my vectors i run out of RAM memory quickly (on 20k th / 128k vector)
This is my code :
final_x_train = []
l=np.zeros((max,300)) # The legnth of a google pretained model is 300
for i in new_X_train:
buildWordVector(final_x_train, i, model, l)
gc.collect() #doesn't do anything except slowing the run time
def buildWordVector(new_X, sent, model, l):
for x in range(len(sent)):
try:
l[x]= list(model[sent[x]])
gc.collect() #doesn't do anything except slowing the run time
except KeyError:
continue
new_X.append([list(x) for x in l])
all the variable that i have :
df: 16.8MiB
new_X_train: 1019.1KiB
X_train: 975.5KiB
y_train: 975.5KiB
new_X_test: 247.7KiB
X_test: 243.9KiB
y_test: 243.9KiB
l: 124.3KiB
final_x_train: 76.0KiB
stop_words: 8.2KiB
But I am at 12Gb/12Gb (RAM) and the session has expired
As you can see the garbage collector is not doing anything because apperently is cannot see the variables but I really need a solution to solve this problem can anyone help me please?
In general in a garbage-collected language like Python you don't need to explicitly request garbage-collection: it happens automatically when you've stopped retaining references (variables/transitive-property-references) to objects.
So, if you're getting a memory error here, it's almost certainly because you're really trying to use more than the available amount of memory at a time.
Your code is a bit incomplete and unclear – what is max
? what is new_X_train
? where are you getting those memory sizing estimates? etc.
But notably: it's not typical to represent a sentence as a concatenation of each word's vector. (So that, with 300d word-vectors, and an up-to-10-word sentence, you have a 3000d sentence-vector.) It's far more common to average the word-vectors together, so both words and sentences have the same size, and there's no blank padding at the end of short sentences.
(That's still a very crude way to create text-vectors, but more common than padding-to-maximum-sentence-size.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.