Numpy 数组占用太多内存

Question

I am loading a csv file via numpy.loadtxt into a numpy array.我正在通过 numpy.loadtxt 将一个 csv 文件加载到一个 numpy 数组中。 My data has about 1 million records and 87 columns.我的数据有大约 100 万条记录和 87 列。 While the object.nbytes is only 177159666 bytes, it actually takes much more meomory because I get 'MemoryError' while training a Decision Tree using scikit-learn.虽然 object.nbytes 只有 177159666 字节，但它实际上需要更多的内存，因为我在使用 scikit-learn 训练决策树时得到“MemoryError”。 Also, after reading the data, the available memory in my system reduced by 1.8 gigs.此外，在读取数据后，我系统中的可用内存减少了 1.8 gigs。 I am working on linux machine with 3 gigs of memory.我正在使用 3 演出内存的 linux 机器上工作。 So does object.nbytes returns the real memory usage of an numpy array?那么 object.nbytes 是否返回 numpy 数组的实际内存使用情况？

train = np.loadtxt('~/Py_train.csv', delimiter=',', skiprows=1, dtype='float16')

Answer 1

I had a similar problem when trying to create a large 400,000 x 100,000 matrix.我在尝试创建一个 400,000 x 100,000 的大型矩阵时遇到了类似的问题。 Fitting all of that data into an ndarray is impossible.将所有这些数据放入一个 ndarray 是不可能的。

However, the big insight I came up with was that most of the values in the matrix are empty, and thus this can be represented as a sparse matrix.然而，我想出的重要见解是矩阵中的大多数值都是空的，因此可以将其表示为稀疏矩阵。 Sparse matrices are useful because it is able to represent the data using less memory.稀疏矩阵很有用，因为它能够使用较少的内存来表示数据。 I used Scipy.sparse's sparse matrix implementation, and I'm able to fit this large matrix in-memory.我使用了 Scipy.sparse 的稀疏矩阵实现，并且我能够在内存中适应这个大矩阵。

Here is my implementation:这是我的实现：

https://github.com/paolodm/Kaggle/blob/master/mdschallenge/buildmatrix.py https://github.com/paolodm/Kaggle/blob/master/mdschallenge/buildmatrix.py

Answer 2

Probably, better performance is by using numpy.fromiter :可能更好的性能是使用numpy.fromiter ：

In [30]: numpy.fromiter((tuple(row) for row in csv.reader(open('/tmp/data.csv'))), dtype='i4,i4,i4')
Out[30]: 
array([(1, 2, 3), (4, 5, 6)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

where在哪里

$ cat /tmp/data.csv 
1,2,3
4,5,6

Alternatively, I strongly suggest you to use pandas : it's based on numpy and has many utility functions to do statistical analysis.或者，我强烈建议您使用pandas ：它基于numpy并且具有许多实用函数来进行统计分析。

Answer 3

I just had the same problem:我刚刚遇到了同样的问题：

My saved .npy file is 752M (on disk), and arr.nbytes = 701289568 (~669M);我保存的 .npy 文件是 752M（在磁盘上），并且 arr.nbytes = 701289568（~669M）； but np.load take 2.7g memory, ie 4x time the actual memory needed但是 np.load 占用 2.7g 内存，即实际所需内存的 4 倍

https://github.com/numpy/numpy/issues/17461 https://github.com/numpy/numpy/issues/17461

and it turns out:事实证明：

the data array contains mixed (small amount of) strings and (large amount of) numbers.数据数组包含混合（少量）字符串和（大量）数字。

But each of those 8-byte locations points to a python object, and that object takes at least 24 bytes plus either space for the number or the string.但是这些 8 字节位置中的每一个都指向一个 python 对象，该对象至少需要 24 个字节加上数字或字符串的空间。

so, in memory (8-byte pointer + 24-bytes) ~= 4x times of mostly 8-byte (double number) in the file.因此，在内存中（8 字节指针 + 24 字节）~= 文件中大部分 8 字节（双数）的 4 倍。

NOTE: np.save() and np.load() is not symmetric:注意： np.save() 和 np.load() 不是对称的：

-- np.save() save the numeric type as scalar data, hence the disk file size is consistent with data size user have in mind, and is small -- np.save() 将数字类型保存为标量数据，因此磁盘文件大小与用户心目中的数据大小一致，且较小

-- np.load() load the numeric type as PyObject, and inflate the memory usage 4x than the user expected. -- np.load() 将数字类型加载为 PyObject，并将内存使用量膨胀为用户预期的 4 倍。

This is the same for other file format, eg csv files.这对于其他文件格式也是一样的，例如 csv 文件。

Conclusion: do not use mixed types (string as np.object, and np.numbers) in a np array.结论：不要在 np 数组中使用混合类型（字符串作为 np.object 和 np.numbers）。 Use homogenous numeric type, eg np.double.使用同构数字类型，例如 np.double。 Then memory will take about the same space as the dump disk file.然后内存将占用与转储磁盘文件大致相同的空间。

Numpy 数组占用太多内存

问题描述

3 个解决方案

解决方案1
5 2012-08-02 15:15:57

解决方案2
3 已采纳 2012-08-02 15:18:37

解决方案3
0 2020-10-05 21:42:56

Numpy 数组占用太多内存

问题描述

3 个解决方案

解决方案1 5 2012-08-02 15:15:57

解决方案2 3 已采纳 2012-08-02 15:18:37

解决方案3 0 2020-10-05 21:42:56

解决方案1
5 2012-08-02 15:15:57

解决方案2
3 已采纳 2012-08-02 15:18:37

解决方案3
0 2020-10-05 21:42:56