简体   繁体   English

将 tensorflow 数据集从 keras 拆分为训练集、测试集和验证集。预处理 API

[英]Splitting a tensorflow dataset into training, test, and validation sets from keras.preprocessing API

I'm new to tensorflow/keras and I have a file structure with 3000 folders containing 200 images each to be loaded in as data.我是 tensorflow/keras 的新手,我有一个包含 3000 个文件夹的文件结构,每个文件夹包含 200 张图像,每个图像都作为数据加载。 I know that keras.preprocessing.image_dataset_from_directory allows me to load the data and split it into training/validation set as below:我知道keras.preprocessing.image_dataset_from_directory允许我加载数据并将其拆分为训练/验证集,如下所示:

val_data = tf.keras.preprocessing.image_dataset_from_directory('etlcdb/ETL9G_IMG/', 
                                                           image_size = (128, 127),
                                                           validation_split = 0.3,
                                                           subset = "validation",
                                                           seed = 1,
                                                           color_mode = 'grayscale',
                                                           shuffle = True)

Found 607200 files belonging to 3036 classes.找到属于 3036 个类的 607200 个文件。 Using 182160 files for validation.使用 182160 文件进行验证。

But then I'm not sure how to further split my validation into a test split while maintaining proper classes.但是我不确定如何在维护适当的类的同时将我的验证进一步拆分为测试拆分。 From what I can tell (through the GitHub source code ), the take method simply takes the first x elements of the dataset, and skip does the same.据我所知(通过 GitHub 源代码),take 方法只获取数据集的前 x 个元素,而 skip 方法也一样。 I am unsure if this maintains stratification of the data or not, and I'm not quite sure how to return labels from the dataset to test it.我不确定这是否会保持数据的分层,而且我不太确定如何从数据集中返回标签来测试它。

Any help would be appreciated.任何帮助,将不胜感激。

You almost got the answer.你几乎得到了答案。 The key is to use .take() and .skip() to further split the validation set into 2 datasets -- one for validation and the other for test.关键是使用.take().skip()将验证集进一步拆分为 2 个数据集——一个用于验证,另一个用于测试。 If I use your example, then you need to execute the following lines of codes.如果我使用你的例子,那么你需要执行下面几行代码。 Let's assume that you need 70% for training set, 10% for validation set, and 20% for test set.假设您需要 70% 用于训练集,10% 用于验证集,20% 用于测试集。 For the sake of completeness, I am also including the step to generate the training set.为了完整起见,我还包括了生成训练集的步骤。 Let's also assign a few basic variables that must be same when first splitting the entire data set into training and validation sets.我们还分配了一些基本变量,这些变量在第一次将整个数据集拆分为训练集和验证集时必须相同。

seed_train_validation = 1 # Must be same for train_ds and val_ds
shuffle_value = True
validation_split = 0.3

train_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "training",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)

val_ds = tf.keras.utils.image_dataset_from_directory(
directory ='etlcdb/ETL9G_IMG/',
image_size = (128, 127),
validation_split = validation_split,
subset = "validation",
seed = seed_train_validation,
color_mode = 'grayscale',
shuffle = shuffle_value)

Next, determine how many batches of data are available in the validation set using tf.data.experimental.cardinality , and then move the two-third of them (2/3 of 30% = 20%) to a test set as follows.接下来,使用tf.data.experimental.cardinality确定验证集中有多少批数据可用,然后将其中的三分之二(30% 的 2/3 = 20%)移至测试集,如下所示。 Note that the default value of batch_size is 32 (re: documentation ).请注意, batch_size的默认值为 32( 关于文档)。

val_batches = tf.data.experimental.cardinality(val_ds)
test_ds = val_ds.take((2*val_batches) // 3)
val_ds = val_ds.skip((2*val_batches) // 3)

All the three datasets ( train_ds , val_ds , and test_ds ) yield batches of images together with labels inferred from the directory structure.所有三个数据集( train_dsval_dstest_ds )都会生成一批图像以及从目录结构推断出的标签。 So, you are good to go from here.所以,你可以从这里拨打 go。

I could not find supporting documentation, but I believe image_dataset_from_directory is taking the end portion of the dataset as the validation split.我找不到支持文档,但我相信image_dataset_from_directory将数据集的末尾部分作为验证拆分。 shuffle is now set to True by default, so the dataset is shuffled before training, to avoid using only some classes for the validation split. shuffle现在默认设置为True ,因此数据集在训练之前会被打乱,以避免仅使用某些类进行验证拆分。 The split done by image_dataset_from_directory only relates to the training process. image_dataset_from_directory进行的拆分只与训练过程有关。 If you need a (highly recommended) test split, you should split your data beforehand into training and testing.如果您需要(强烈推荐)测试拆分,则应事先将数据拆分为训练和测试。 Then, image_dataset_from_directory will split your training data into training and validation.然后, image_dataset_from_directory会将您的训练数据拆分为训练和验证。

I usually take a smaller percent (10%) for the in-training validation, and split the original dataset 80% training, 20% testing.我通常将较小的百分比(10%)用于训练中验证,并将原始数据集拆分为 80% 训练,20% 测试。 With these values, the final splits (from the initial dataset size) are:使用这些值,最终拆分(从初始数据集大小)是:

  • 80% training: 80% 培训:
    • 72% training (used to adjust the weights in the network) 72% 训练(用于调整网络中的权重)
    • 8% in-training validation (used only to check the metrics of the model after each epoch) 8% 训练中验证(仅用于在每个 epoch 后检查 model 的指标)
  • 20% testing (never seen by the training process at all) 20% 测试(在训练过程中从未见过)

There is additional information how to split data in your directories in this question: Keras split train test set when using ImageDataGenerator在这个问题中有关于如何在您的目录中拆分数据的其他信息: Keras split train test set when using ImageDataGenerator

For splitting into train and validation maybe you can do smth like that.为了分成训练和验证,也许你可以这样做。

The main point is to keep the same seed.要点是保持相同的种子。

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    label_mode='categorical',
    validation_split=0.2,
    subset="training",
    seed=1337,
    color_mode="grayscale",
    image_size=image_size,
    batch_size=batch_size,
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    validation_split=0.2,
    subset="validation",
    label_mode='categorical',
    seed=1337,
    color_mode="grayscale",
    image_size=image_size,
    batch_size=batch_size,
)

is taken from: https://keras.io/examples/vision/image_classification_from_scratch/取自: https://keras.io/examples/vision/image_classification_from_scratch/

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 ImportError 与 keras.preprocessing - ImportError with keras.preprocessing 导入错误:无法从“keras.preprocessing”导入名称“image_dataset_from_directory” - ImportError: cannot import name 'image_dataset_from_directory' from 'keras.preprocessing' 预处理图像数据集,包括划分为训练集和测试集 - Preprocessing an image dataset including division into training and test sets 训练和验证数据集的拆分 - Splitting of training and validation dataset 如何将图像数据集拆分为 python 中的测试/训练/验证集? - How to split images dataset into test/training/validation sets in python? ImportError:无法从“keras.preprocessing”导入名称“load_img” - ImportError: cannot import name 'load_img' from 'keras.preprocessing' 将从 tensorflow.keras.preprocessing.text_dataset_from_directory() 获得的数据集保存在外部文件中 - Save in a external file the dataset obtained from tensorflow.keras.preprocessing.text_dataset_from_directory() 导入错误:无法从“tensorflow.keras.preprocessing”(未知位置)导入名称“image_dataset_from_directory” - ImportError: cannot import name 'image_dataset_from_directory' from 'tensorflow.keras.preprocessing' (unknown location) 如何为 tensorflow.keras.preprocessing.text_dataset_from_directory 使用多个输入 - How to use multiple inputs for tensorflow.keras.preprocessing.text_dataset_from_directory 来自 tf.keras.preprocessing.image_dataset_from_directory 的 x_test 和 y_test - x_test and y_test from tf.keras.preprocessing.image_dataset_from_directory
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM