简体   繁体   English

调用np.array(data)时出现MemoryError的列表的稀疏列表

[英]Large, sparse list of lists giving MemoryError when calling np.array(data)

I have a large matrix of 0 s and 1 s, that is mostly 0 s. 我有一个0 s和1 s的大矩阵,大部分是0 s。 It is initially stored as a list of 25 thousand other lists, each of which are about 2000 ints long. 它最初存储为25 thousand其他列表的列表,每个列表的长度约为2000 int。

I am trying to put these into a numpy array, which is what another piece of my program takes. 我试图将它们放入一个numpy数组中,这是我程序的另一部分。 So I run training_data = np.array(data) , but this returns a MemoryError 所以我运行training_data = np.array(data) ,但这返回了MemoryError

Why is this happening? 为什么会这样呢? I'm assuming it is too much memory for the program to handle (which is surprising to me..), but if so, is there a better way of doing this? 我以为程序要处理的内存太多了(这让我感到惊讶。),但是如果是这样,有没有更好的方法呢?

A (short) integer takes two bytes to store. (短)整数需要两个字节来存储。 You want 25,000 lists, each with 2,000 integers; 您需要25,000个列表,每个列表具有2,000个整数; that gives 这给

25000*2000*2/1000000 = 100 MB

This works fine on my computer (4GB RAM): 在我的计算机(4GB RAM)上可以正常工作:

>>> import numpy as np
>>> x = np.zeros((25000,2000),dtype=int)

Are you able to instantiate the above matrix of zeros? 您可以实例化上述零矩阵吗?

Are you reading the file into a Python list of lists and then converting that to a numpy array? 您是否正在将文件读入Python列表列表,然后将其转换为numpy数组? That's a bad idea; 那是个坏主意; it will at least double the memory requirements. 它将至少使内存需求增加一倍。 What is the file format of your data? 数据的文件格式是什么?

For sparse matrices scipy.sparse provides various alternative datatypes which will be much more efficient. 对于稀疏矩阵, scipy.sparse提供了各种替代数据类型,它们将更加有效。


EDIT: responding to the OP's comment. 编辑:回应OP的评论。

I have 25000 instances of some other class, each of which returns a list of length about 2000. I want to put all of these lists returned into the np.array . 我有25000个其他类的实例,每个实例返回一个长度约为2000的列表。我想将所有返回的这​​些列表放入np.array

Well, you're somehow going over 8GB! 好吧,您以某种方式超过了8GB! To solve this, don't do all this manipulation in memory. 要解决此问题,请不要在内存中进行所有此类操作。 Write the data to disk a class at a time, then delete the instances and read in the file from numpy. 一次将数据写入磁盘一个类,然后删除实例并从numpy中读取文件。

First do 先做

with open(..., "wb") as f:
    f = csv.writer(f)
    for instance in instances:
        f.writerow(instance.data)

This will write all your data into a large-ish CSV file. 这会将您的所有数据写入一个很大的CSV文件中。 Then, you can just use np.loadtxt : 然后,您可以使用np.loadtxt

numpy.loadtxt(open(..., "rb"), delimiter=",")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM