简体   繁体   中英

How to deal with large csv file when training a deep learning model?

I have a huge dataset for training a deep learning model. It's in a .csv format. It's around 2GB and right now, I'm just loading the entire data into memory with pandas.

df = pd.read_csv('test.csv')

and then providing everything into the keras model and then training the model like below,

model.fit(df, targets)

I want to know what other options I have when dealing with even large datasets. Like around 10 GB (or) something. I don't have the ram to load everything on to the memory and pass it to the model.

One way I could think of is to somehow get random sample/subset of data from the .csv file and use it via a data generator but the problem is I couldn't find any way to read a subset/sample of a csv file without loading everything into the memory.

How can I train the model without loading everything in to the memory? It's okay if you have any solutions and it uses some memory. Just let me know.

I've not used this functionality before, but maybe something like:

class CsvSequence(Sequence):
    def __init__(self, batchnames):
       self.batchnames = batchnames

    def __len__(self):
       return len(self.batchnames)

    def __getitem__(self, i):
       name = self.batchnames[i]
       X = pd.read_csv(name + '-X.csv')
       Y = pd.read_csv(name + '-Y.csv')
       return X, Y

would work. you'd need to preprocess your data by splitting your 10GB file up into, eg, 10 smaller files. the Unix split utility might be enough if your CSV files have one record per line (most do)

as an incomplete example of how to use this:

seq = CsvSequence([
  'data-1', 'data-2', 'data-3'])

model.fit_generator(seq)

but note that you'd quickly want to do something more efficient, the above would cause your CSV files to be read many times. it wouldn't surprise me if this loading took more time than the everything else put together

one suggestion would be to preprocess the files before training, saving them to numpy binary files . the binary files could then mmap ed in while load ing which is much more efficient.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM