简体   繁体   中英

Numpy array taking too much memory

I am loading a csv file via numpy.loadtxt into a numpy array. My data has about 1 million records and 87 columns. While the object.nbytes is only 177159666 bytes, it actually takes much more meomory because I get 'MemoryError' while training a Decision Tree using scikit-learn. Also, after reading the data, the available memory in my system reduced by 1.8 gigs. I am working on linux machine with 3 gigs of memory. So does object.nbytes returns the real memory usage of an numpy array?

train = np.loadtxt('~/Py_train.csv', delimiter=',', skiprows=1, dtype='float16')

I had a similar problem when trying to create a large 400,000 x 100,000 matrix. Fitting all of that data into an ndarray is impossible.

However, the big insight I came up with was that most of the values in the matrix are empty, and thus this can be represented as a sparse matrix. Sparse matrices are useful because it is able to represent the data using less memory. I used Scipy.sparse's sparse matrix implementation, and I'm able to fit this large matrix in-memory.

Here is my implementation:

https://github.com/paolodm/Kaggle/blob/master/mdschallenge/buildmatrix.py

Probably, better performance is by using numpy.fromiter :

In [30]: numpy.fromiter((tuple(row) for row in csv.reader(open('/tmp/data.csv'))), dtype='i4,i4,i4')
Out[30]: 
array([(1, 2, 3), (4, 5, 6)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

where

$ cat /tmp/data.csv 
1,2,3
4,5,6

Alternatively, I strongly suggest you to use pandas : it's based on numpy and has many utility functions to do statistical analysis.

I just had the same problem:

My saved .npy file is 752M (on disk), and arr.nbytes = 701289568 (~669M); but np.load take 2.7g memory, ie 4x time the actual memory needed

https://github.com/numpy/numpy/issues/17461

and it turns out:

the data array contains mixed (small amount of) strings and (large amount of) numbers.

But each of those 8-byte locations points to a python object, and that object takes at least 24 bytes plus either space for the number or the string.

so, in memory (8-byte pointer + 24-bytes) ~= 4x times of mostly 8-byte (double number) in the file.

NOTE: np.save() and np.load() is not symmetric:

-- np.save() save the numeric type as scalar data, hence the disk file size is consistent with data size user have in mind, and is small

-- np.load() load the numeric type as PyObject, and inflate the memory usage 4x than the user expected.

This is the same for other file format, eg csv files.

Conclusion: do not use mixed types (string as np.object, and np.numbers) in a np array. Use homogenous numeric type, eg np.double. Then memory will take about the same space as the dump disk file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM