简体   繁体   English

keras:将 ImageDataGenerator 和 KFold 用于 fit_generator 的问题

[英]keras: issue using ImageDataGenerator and KFold for fit_generator

flow_from_directory (directory): This takes in directory but does not take split training images. flow_from_directory (directory):它接收目录但不接收分割训练图像。

sklearn.model_selection.KFold: Provides the split indices of images. sklearn.model_selection.KFold:提供图像的分割索引。 Those could be used in fit() but not in fit_generator()这些可以在 fit() 中使用,但不能在 fit_generator() 中使用

How can anyone use KFold along with ImageDataGenerator?如何将 KFold 与 ImageDataGenerator 一起使用? Is it there?它在吗?

At the moment one cannot split a dataset held in the folder using a flow_from_directory generator.目前,无法使用flow_from_directory生成器拆分文件夹中保存的数据集。 This option is simply not implemented.这个选项根本没有实现。 To get the test / train split one need to split the main directory into set of train / test /val directories using eg os library in Python.要获得 test / train 拆分,需要使用 Python 中的os库将主目录拆分为一组 train / test / val 目录。

Assuming that you have a classification problem with 2 classes, I would do something like:假设您有 2 个类的分类问题,我会执行以下操作:

from keras.utils import to_categorical  

train_y = to_categorical(train_y, num_classes=2)
test_y = to_categorical(test_y, num_classes=2)

aug = ImageDataGenerator(...) #your ImageDataGenerator

Model = model.fit_generator(aug.flow(train_x,tain_y, batch_size=32), 
            validation_data=(test_x,test_y))

To anyone, who bumped into this problem: to the date, at which this answer was posted - there's no (at least, relatively) simple out-of-the-box solution in my opinion and deciding by the result of my own searches.对于遇到此问题的任何人:截至发布此答案的日期 - 我认为没有(至少相对)简单的开箱即用解决方案,并由我自己的搜索结果决定。

The only solution, that I came up with, resolving similar problem in my project, was to make partitions in my dataset, with number of partitions equal to number of folds, and saving them as dictionary with number of partition as a key and file paths list as value for partition.我想出的唯一解决方案是在我的项目中解决类似问题,是在我的数据集中进行分区,分区数等于折叠数,并将它们保存为字典,以分区数作为键和文件路径列出作为分区的值。 After that, you still have to sort your files into class folders for train and validation subsets respectively.之后,您仍然需要将文件分别分类到用于训练和验证子集的类文件夹中。

For example: let K=10.例如:让 K=10。 Algorithm can be described like this:算法可以这样描述:

  • Divide your dataset into 10 equally-sized partitions.将您的数据集划分为 10 个大小相同的分区。
  • Take one partition as validation subset.取一个分区作为验证子集。 Sort it by classes into required folders.按类将其排序到所需的文件夹中。
  • Rest of partitions should be considered as training subset and sorted into required folders.其余分区应视为训练子集并分类到所需文件夹中。
  • Create data_generators for val and train subsets.为 val 和 train 子集创建 data_generators。
  • Train your model and save it using your architecture .训练您的模型并使用您的架构保存它。
  • Repeat steps described above for every other partition (take one partition as val, train on others) but now you have to load your model from save file .对每个其他分区重复上述步骤(将一个分区作为 val,训练其他分区),但现在您必须从保存文件加载模型。

I'm afraid that code snippet for this solution (including sorting script and partition dictionary forming script) is too large to provide it there, but I'll gladly share it if necessary.恐怕此解决方案的代码片段(包括排序脚本和分区字典形成脚本)太大而无法在此处提供,但如有必要,我很乐意分享。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM