简体   繁体   中英

Storing/Loading huge numpy array with less memory

I have a numpy array of shape (20000, 600, 768). I need to store it, so later I could load it back to my code. The main problem is memory usage when you load it back. I have just 16GB RAM.

For example, I tried pickle. When it loads it all I almost have no memory left to do anything else. Especially to train the model.

I tried write and load back with hdf5 (h5py). Just a small piece (1000, 600, 768). But it seems like it "eats" even more memory.

Also tried csv.. That's just a no-no. Takes TOO much time to write data in.

Would be grateful for any suggestions how I could store my array so when I would load it back it wouldn't take that much memory.

PS The data I store is vector representation of texts which I later use for training my model.

I think that you can do a lot of things.

First of all you can change the data format to be stored in different ways:

  • in a file in your secondary memory to be read iteratively (dumping a python object on secondary memory is not efficient. You need to find a better format. For example a text file in which the lines are the rows of the matrix)
  • or in a database. Always to make the data readable in an iterative manner.

Second, and most important, you need to change your algorithm . If you cannot fit all the data in memory, you need to use other kinds of methods, in which you use batchs of data instead of all the data.

For machine learning for example, there are a lot of methods in which you do incremental updates of the model with batchs of data

Third, there are method in which you can reduce the dimensionality of your training set . For example using methods like PCA, feature selection etc

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM