简体   繁体   English

使用python在Google合作实验室上工作:垃圾收集器无法正常工作吗?

[英]Working on google collab with python : garbage collector is not working?

I am working on google collab using python and I have a 12Gb Ram. 我正在使用python进行谷歌合作,并且有12Gb Ram。 I am trying to use word2vec pre-trained by google to represent sentences by vectors. 我正在尝试使用由Google预训练的word2vec以向量表示句子。 I should have same length vectors even if they do not have the same number of words so I used padding (the maximum length of a sentence here is my variable max) The problem is that every time I want to create a matrix containing all of my vectors i run out of RAM memory quickly (on 20k th / 128k vector) 即使它们没有相同数量的单词,我也应该具有相同的长度向量,所以我使用了填充(这里句子的最大长度是我的变量max)问题是每次我想创建一个包含所有我的矩阵向量我很快用完RAM内存(第20k / 128k向量)

This is my code : 这是我的代码:

final_x_train = []
l=np.zeros((max,300)) # The legnth of a google pretained model is 300 
for i in new_X_train: 
    buildWordVector(final_x_train, i, model, l)
    gc.collect() #doesn't do anything except slowing the run time


def buildWordVector(new_X, sent, model, l):    
    for x in range(len(sent)):
        try:
            l[x]= list(model[sent[x]])
            gc.collect() #doesn't do anything except slowing the run time
    except KeyError:
        continue
    new_X.append([list(x) for x in l])

all the variable that i have : 我拥有的所有变量:

     df:  16.8MiB
     new_X_train: 1019.1KiB
     X_train: 975.5KiB
     y_train: 975.5KiB
     new_X_test: 247.7KiB
     X_test: 243.9KiB
     y_test: 243.9KiB
     l: 124.3KiB
     final_x_train:  76.0KiB
     stop_words:   8.2KiB

But I am at 12Gb/12Gb (RAM) and the session has expired 但是我在12Gb / 12Gb(RAM),会话已经过期

As you can see the garbage collector is not doing anything because apperently is cannot see the variables but I really need a solution to solve this problem can anyone help me please? 如您所见,垃圾收集器没有执行任何操作,因为显然看不到变量,但是我确实需要解决此问题的解决方案,有人可以帮助我吗?

In general in a garbage-collected language like Python you don't need to explicitly request garbage-collection: it happens automatically when you've stopped retaining references (variables/transitive-property-references) to objects. 通常,在像Python 这样的垃圾收集语言中,您不需要显式请求垃圾收集:当您停止保留对对象的引用(变量/传递性属性引用)时,它会自动发生。

So, if you're getting a memory error here, it's almost certainly because you're really trying to use more than the available amount of memory at a time. 因此,如果您在这里遇到内存错误,那几乎可以肯定是因为您实际上一次尝试使用的内存量超过了可用内存量。

Your code is a bit incomplete and unclear – what is max ? 你的代码是一个有点不完整和不清晰的-什么是max what is new_X_train ? 什么是new_X_train where are you getting those memory sizing estimates? 您从哪里获得这些内存大小估计值? etc. 等等

But notably: it's not typical to represent a sentence as a concatenation of each word's vector. 但值得注意的是:将一个句子表示为每个单词的向量的串联并不常见。 (So that, with 300d word-vectors, and an up-to-10-word sentence, you have a 3000d sentence-vector.) It's far more common to average the word-vectors together, so both words and sentences have the same size, and there's no blank padding at the end of short sentences. (因此,使用300d的单词向量和最多10个单词的句子,您就有3000d的句子向量。)将单词向量平均在一起的情况要普遍得多,因此单词和句子具有相同的值大小,并且在短句末没有空白填充。

(That's still a very crude way to create text-vectors, but more common than padding-to-maximum-sentence-size.) (这仍然是创建文本向量的一种非常粗糙的方法,但是比填充到最大句子大小更普遍。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM