简体   繁体   中英

Reading large dataset from HDF5 file into x_train and use it in keras model

I have a large HDF5 file containing 16000 different 512x512 numpy arrays. obviously reading the file to the ram will make it crash (Total size of the file 40 GB).

I want to load this array into data and then split data into train_x and test_x. Tha labels are stored locally.

I did this which only create a path to the file without fetching

    h5 = h5py.File('/file.hdf5', 'r')
    data = h5.get('data')

but when I try to split data into train and test:

x_train= data[0:14000]
y_train= label[0:16000]
x_test= data[14000:]
y_test= label[14000:16000]

I get the error

MemoryError: Unable to allocate 13.42 GiB for an array with shape (14000, 256, 256) and data type float32

I want to load them in batches and train a keras model but obviously previous error doesn't allow me to

model.compile(optimizer=Adam(learning_rate =0.001),loss 
                          ='sparse_categorical_crossentropy',metrics =['accuracy'])
history= model.fit(x_train,y_train,validation_data= 
                         (x_test,y_test),epochs =32,verbose=1)

How can I get around this issue?

First, let's describe what you are doing.
This statement returns a h5py object for the dataset named 'data': data = h5.get('data') . It does NOT load the entire dataset into memory (which is good). Note: that statement is more typically written like this: data = h5.['data'] . Also, I assume there is a similar call to get a h5py object for the 'label' dataset.

Each of your next 4 statements will return a NumPy array based on the indices and dataset. NumPy arrays are stored in memory, which is why you get the memory error. When the program executes x_train= data[0:14000] , you need 13.42 GiB to load the array in memory. (Note: the error implies the arrays are 256x256, not 512x512.)

If you don't have enough RAM to store the array, you will have to "do something" to reduce the memory footprint. Options to consider:

  1. Resize the images from 256x256 (or 512x512) to something smaller and save in new h5 file
  2. Modifiy 'data' to use ints instead of floats and save in new h5 file
  3. Write image data to.npy files and load in batches
  4. Read in fewer images, and train in batches.

I wrote an answer to a somewhat related question that describes h5py behavior with training and testing data, and how to randomized input from.npy files. It might be helpful. See this answer: h5py writing: How to efficiently write millions of.npy arrays to a.hdf5 file?

As an aside, you probably want to randomize your selection of testing and training data (and not simply pick the first 14000 images for training and the last 2000 images for testing). Also, check your indices for y_train= label[0:16000] . I think you will get an error with mismatched x_train and y_train sizes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM