简体   繁体   English

用numpy保存列表时内存不足

[英]Run out of memory when saving list with numpy

I have a fairly large list of lists representing the tokens in the Sogou text classification data set. 我有一个很大的列表列表,代表Sogou文本分类数据集中的标记。 I can process the entire training set of 450 000 with 12 gigs of ram left over, but when I call numpy.save() on the list of lists the memory usage seems to double and I run out of memory. 我可以用剩下的12个演出来处理整个450 000的训练集,但是当我在列表列表中调用numpy.save()时,内存使用量似乎翻了一番,而且我的内存不足。

Why is this? 为什么是这样? Does the numpy.save convert the list before saving but retain the list thus using more memory? numpy.save是否在保存之前转换列表,但保留列表,从而占用更多内存?

Is there an alternative way to save this list of lists ie pickling? 有没有其他方法可以保存此列表,即酸洗? I believe numpy save uses the pickle protocol judging from the allow pickle argument: https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html 我相信numpy保存使用允许pickle参数判断的pickle协议: https : //docs.scipy.org/doc/numpy/reference/generated/numpy.save.html

print "Collecting Raw Documents, tokenize, and remove stop words"
df = pd.read_pickle(path + dataSetName + "Train")
frequency = defaultdict(int)

gen_docs = []
totalArts = len(df)
for artNum in range(totalArts):
    if artNum % 2500 == 0:
        print "Gen Docs Creation on " + str(artNum) + " of " + str(totalArts)
    bodyText = df.loc[artNum,"fullContent"]
    bodyText = re.sub('<[^<]+?>', '', str(bodyText))
    bodyText = re.sub(pun, " ", str(bodyText))
    tmpDoc = []
    for w in word_tokenize(bodyText):
        w = w.lower().decode("utf-8", errors="ignore")
        #if w not in STOPWORDS and len(w) > 1:
        if len(w) > 1:
            #w = wordnet_lemmatizer.lemmatize(w)
            w = re.sub(num, "number", w)
            tmpDoc.append(w)
            frequency[w] += 1
    gen_docs.append(tmpDoc)
print len(gen_docs)

del df
print "Saving unfiltered gen"
dataSetName = path + dataSetName
np.save("%s_lemmaWords_noStop_subbedNums.npy" % dataSetName, gen_docs)

np.save first tries to convert the input into an array. np.save首先尝试将输入转换为数组。 After all it is designed to save numpy arrays. 毕竟,它旨在保存numpy数组。

If the resulting array is multidimensional with numeric or string values (dtype) it saves some basic dimension information, plus a memory copy of the arrays data buffer. 如果结果数组是具有数字或字符串值(dtype)的多维数组,它将保存一些基本的维信息,以及数组数据缓冲区的内存副本。

But if the array contains other objects (eg dtype object), then it pickles those objects, and saves the resulting string(s). 但是,如果数组包含其他对象(例如dtype对象),则它将腌制这些对象,并保存结果字符串。

I would try 我会尝试

arr = np.array(gen_docs)

Does that produce a memory error? 这会产生内存错误吗?

If not, what is its shape and dtype ? 如果不是,它的shapedtype什么?

If the tmpDoc (sublists) vary in length the arr will be a 1d array with object dtype - those objects being the tmpDoc lists. 如果tmpDoc (子列表)的长度不同,则arr将是具有对象tmpDoc数组-这些对象是tmpDoc列表。

Only if all the tmpDoc have the same length will it produce a 2d array. 只有所有tmpDoc的长度相同,它才会生成2d数组。 Even then the dtype will depend on the elements, whether numbers, strings, or other objects. 即使这样,dtype仍将取决于元素,无论是数字,字符串还是其他对象。

I might add that an array is pickled with the save protocol. 我可能会补充说,数组是用save协议腌制的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM