如何将 tf.data.Dataset 拆分为 x_train、y_train、x_test、y_test for keras

Question

If I have a dataset如果我有一个数据集

dataset = tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    labels="inferred",
    label_mode="int",
    class_names=None,
    color_mode="rgb",
    batch_size=32,
    image_size=(32, 32),
    shuffle=True,
    seed=None,
    validation_split=None,
    subset=None,
    interpolation="bilinear",
    follow_links=False,
)

how do I separate this into x and y arrays?我如何将它分成 x 和 y 数组？ The x array would be the IMG array and the y array would have the category for each img. x 数组将是 IMG 数组，而 y 数组将具有每个 img 的类别。

Answer 1

This will do the separation for you.这将为你做分离。 What you need to do is create a directory let's call it c:\\train.您需要做的是创建一个目录，我们称之为 c:\\train。 Now in that directory you will need to create a series of subdirectories, one per class.现在在该目录中，您需要创建一系列子目录，每个类一个。 For example if you had images of dogs and images of cats and you want to build a classifier to distinguish images as being either a cat or a dog then create two sub directories within the train directory.例如，如果您有狗的图像和猫的图像，并且您想构建一个分类器来区分图像是猫还是狗，那么在 train 目录中创建两个子目录。 Name one directory cats, name the other sub directory dogs.将一个目录命名为cats，将另一个子目录命名为dogs。 Now place all the images of cats in the cat sub directory and all the images of dogs into the dogs sub directory.现在将所有猫的图像放在 cat 子目录中，将所有狗的图像放在 dog 子目录中。 Now let's assume you want to use 75% of the images for training and 25% of the images for validation.现在让我们假设您要使用 75% 的图像进行训练，并使用 25% 的图像进行验证。 Now use the code below to create a training set and a validation set.现在使用下面的代码创建一个训练集和一个验证集。

train_batch_size = 50  # Set the training batch size you desire
valid_batch_size = 50  # Set this so that .25 X total sample/valid_batch_size is an integer
dir = r'c:\train'
img_size = 224  # Set this to the desired image size you want to use
train_set = tf.keras.preprocessing.image_dataset_from_directory(
    directory=dir, labels='inferred', label_mode='categorical', class_names=None,
    color_mode='rgb', batch_size=train_batch_size, image_size=(img_size, img_size),
    shuffle=True, seed=None, validation_split=.25, subset="training",
    interpolation='nearest', follow_links=False)
valid_set = tf.keras.preprocessing.image_dataset_from_directory(
    directory=dir, labels='inferred', label_mode='categorical', class_names=None,
    color_mode='rgb', batch_size=valid_batch_size, image_size=(img_size, img_size),
    shuffle=False, seed=None, validation_split=.25, subset="validation",
    interpolation='nearest', follow_links=False)

With labels='inferred' the labels will be the names of the sub directories .In the example they would be cats and dogs.使用labels='inferred' 标签将是子目录的名称。在示例中，它们将是猫和狗。 With label_mode='categorical' the label data are one hot vectors, so when you compile your model set loss='CategoricalCrossentropy'.使用 label_mode='categorical' 标签数据是一个热向量，因此当您编译模型时设置 loss='CategoricalCrossentropy'。 Note for the training set shuffle is set to true whereas for validation set shuffle set to False.注意训练集 shuffle 设置为 true 而验证集 shuffle 设置为 False。 When you build your model the top layer should have 2 nodes and the activation should be softmax.构建模型时，顶层应该有 2 个节点，激活应该是 softmax。 When you use model.fit to train your model it is desireable to go through your validation set once per epoch.当您使用 model.fit 来训练您的模型时，希望每个 epoch 都经过一次验证集。 So say in the dog-cat example you have 1000 dog images and 1000 cat images for a total of 2000. 75% = 1500 will be used for training and 500 will be used for validation.因此，在狗猫示例中，您有 1000 张狗图像和 1000 张猫图像，总共 2000 张。75% = 1500 将用于训练，500 将用于验证。 If you set the valid_batch_size=50 it will take 10 steps to go through all the validation images once per epoch.如果您设置 valid_batch_size=50，则每个 epoch 将需要 10 个步骤来遍历所有验证图像。 Similarly if train_batch_size=50 it will take 30 steps to go through the training set.同样，如果 train_batch_size=50，则需要 30 个步骤来遍历训练集。 When you run model.fit set steps_per_epoch=30 and validation_steps=10.当您运行 model.fit 时，设置 steps_per_epoch=30 和 validation_steps=10。 Actually I prefer to use tf.keras.preprocessing.image.ImageDataGenerator for generating data sets.实际上我更喜欢使用 tf.keras.preprocessing.image.ImageDataGenerator 来生成数据集。 It is similar but more versatile.它很相似，但用途更广。 Documentation is here.文档在这里。 If like it because it allows you to specify a pre=processing function if you wish and also allows you to rescale your images values.如果喜欢它，因为它允许您根据需要指定预处理功能，还允许您重新调整图像值。 Typically you want to use 1/255 as the rescale value.通常，您希望使用 1/255 作为重新调整值。

If you just want to split up the training data you can use train_test_split from sklearn.Documentation is here.如果您只想拆分训练数据，您可以使用 sklearn.Documentation 中的 train_test_split 。 , Code below shows how to seperate it into a training set, a validation set and a test set. , 下面的代码展示了如何将其分成训练集、验证集和测试集。 Assume you want 80% of data for training, 10% for validation and 10% for test.假设您需要 80% 的数据用于训练，10% 用于验证，10% 用于测试。 Assume X is an np array of images and y is the associated array of labels.假设 X 是一个 np 图像数组，y 是相关的标签数组。 Code below shows the split下面的代码显示了拆分

from sklearn.model_selection import train_test_split
X_train, X_tv, y_train, y_tv = train_test_split( X, y, train_size=0.8, random_state=42)
X_test, X_valid, y_test, y_valid=train_test_split(X_tv,y_tv, train_size=.5, randon_state=20)

如何将 tf.data.Dataset 拆分为 x_train、y_train、x_test、y_test for keras

问题描述

1 个解决方案

解决方案1
2 2020-09-28 00:02:20

如何将 tf.data.Dataset 拆分为 x_train、y_train、x_test、y_test for keras

问题描述

1 个解决方案

解决方案1 2 2020-09-28 00:02:20

解决方案1
2 2020-09-28 00:02:20