what is the best practices to train model on BIG dataset

Question

I need to train a model on a dataset that required more memory than my GPU has. what is the best practice for feeding the dataset to model?

here is my steps:

first of all, I load dataset using batch_size

BATCH_SIZE=32

builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets  = builder.as_dataset(batch_size=BATCH_SIZE)

the second step i prepare data

for record in raw_train_ds.take(1):
    train_images, train_labels = record['image'],  record['label']
    print(train_images.shape)
    train_images  = train_images.numpy().astype(np.float32) / 255.0
    train_labels = tf.keras.utils.to_categorical(train_labels)

and then i feed data to the model

history = model.fit(train_images,train_labels, epochs=NUM_EPOCHS, validation_split=0.2)

but at step 2 I prepared data for the first batch and missed the rest batches because the model.fit is out of the loop scope (which, as I understand, works for one, first batch only).
On the other hand, I can't remove take(1) and move the model.fit method under the cycle. Because yes, in this case, I will handle all batches, but at the same time model.fill will be called at the end on each iteration and in this case, it also will not work properly

so, how I should change my code to be able to work appropriately with a big dataset using model.fit? could you point article, any documents, or just advise how to deal with it? thanks

Update In my post below (approach 1) I describe one approach on how to solve the problem - are there any other better approaches or it is only one way how to solve this?

Answer 1

You can pass the whole dataset to fit for training. As you can see in the documentation , one of the possible values of the first parameter is:

A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights) .

So you just need to convert your dataset to that format (a tuple with input and target) and pass it to fit :

BATCH_SIZE=32

builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
raw_train_ds = datasets['train']
train_dataset_fit = raw_train_ds.map(
    lambda x: (tf.cast.dtypes(x['image'], tf.float32) / 255.0, x['label']))
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)

One problem with this is that it does not support a validation_split parameter but, as shown in this guide , tfds already gives you the functionality to have the splits of the data. So you would just need to get the test split dataset, transform it as above and pass it as validation_data to fit .

Answer 2

Approach 1

Thank @jdehesa I changed my code:

load dataset - in reality, it doesn't load data into memory till the first call 'next' from the dataset iterator. and even then, I think the iterator will load a portion of data (batch) with a size equal in BATCH_SIZE

raw_train_ds, raw_validation_ds = builder.as_dataset(split=["train[:90%]", "train[10%:]"], batch_size=BATCH_SIZE)

collected all required transformation into one method

def prepare_data(x):
    train_images, train_labels = x['image'],  x['label']
# TODO: resize image
    train_images = tf.cast(train_images,tf.float32)/ 255.0 
    # train_labels = tf.keras.utils.to_categorical(train_labels,num_classes=NUM_CLASSES) 
    train_labels = tf.one_hot(train_labels,NUM_CLASSES) 
    return (train_images, train_labels)

applied these transformations to each element in batch (dataset) using the method td.data.Dataset.map

train_dataset_fit = raw_train_ds.map(prepare_data)

and then fed this dataset into model.fit - as I understand the model.fit will iterate through all batches in the dataset.

train_dataset_fit = raw_train_ds.map(prepare_data)
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)

what is the best practices to train model on BIG dataset

Question

2 answers

solution1
2 2020-05-04 09:43:57

solution2
1 2020-05-05 06:59:47

Approach 1

what is the best practices to train model on BIG dataset

Question

2 answers

solution1 2 2020-05-04 09:43:57

solution2 1 2020-05-05 06:59:47

Approach 1

solution1
2 2020-05-04 09:43:57

solution2
1 2020-05-05 06:59:47