[英]Handle invalid/corrupted image files in ImageDataGenerator.flow_from_directory in Keras
I am using Python with Keras and running ImageDataGenerator
and using flow_from_directory
. 我将Python与
flow_from_directory
一起使用并运行ImageDataGenerator
并使用flow_from_directory
。 I have some problematic image files, so can I use the data generator in order to handle the read errors? 我有一些有问题的图像文件,因此可以使用数据生成器来处理读取错误吗?
I am getting some "not valid jpg file" on a small portion of the images and would like to treat this without my code crashing. 我在图像的一小部分上得到一些“无效的jpg文件”,并且希望在不导致代码崩溃的情况下进行处理。
Well, one solution is to modify the ImageDataGenerator
code and put error handling mechanism (ie try/except) in it. 嗯,一种解决方案是修改
ImageDataGenerator
代码,并在其中添加错误处理机制(即try / except)。
However, one alternative is to wrap your generator inside another generator and use try/except there. 但是,一种替代方法是将您的生成器包装在另一个生成器中,然后在其中使用try / except。 The disadvantage of this solution is that it throws away the whole generated batch even if one single image is corrupted in that batch (this may mean that it is possible that some of the samples may not be used for training at all):
此解决方案的缺点是,即使该批次中的一个图像损坏,它也会丢弃整个生成的批次(这可能意味着某些样本可能根本不用于训练):
data_gen = ImageDataGenerator(...)
train_gen = data_gen.flow_from_directory(...)
def my_gen(gen):
while True:
try:
data, labels = next(gen)
yield data, labels
except:
pass
# ... define your model and compile it
# fit the model
model.fit_generator(my_gen(train_gen), ...)
Another disadvantage of this solution is that since you need to specify the number of steps of generator (ie steps_per_epoch
) and considering that a batch may be thrown away in a step and a new batch is fetched instead in the same step, you may end up training on some of the samples more than once in an epoch. 该解决方案的另一个缺点是,由于您需要指定生成器的步数(即
steps_per_epoch
),并考虑到可能在一个步骤中丢弃一个批处理,而在同一步骤中steps_per_epoch
一个新的批处理,因此您可能最终在某个时期不只一次对某些样本进行训练。 This may or may not have significant effects depending on how many batches include corrupted images (ie if there are a few, then there is nothing to be worried about that much). 这可能会或可能不会产生重大影响,具体取决于有多少批次包含损坏的图像(即,如果有少量,则不必担心那么多)。
Finally, note that you may want to use the newer Keras data-generator ie Sequence
class to read images one by one in the __getitem__
method in each batch and discard corrupted ones. 最后,请注意,您可能希望使用更新的Keras数据生成器(即
Sequence
类)来在__getitem__
方法中逐批读取图像并丢弃损坏的图像。 However, the problem of the previous approach, ie training on some of the images more than once, is still present in this approach as well since you also need to implement the __len__
method and it is essentially equivalent to steps_per_epoch
argument. 但是,由于您还需要实现
__len__
方法,因此前一种方法的问题(即多次训练某些图像)仍然存在,因为它实际上等效于steps_per_epoch
参数。 Although, in my opinion, this approach (ie subclassing Sequence
class) is superior to the above approach (of course, if you put aside the fact that you may need to write more code) and have fewer side effects (since you can discard a single image and not the whole batch). 尽管在我看来,这种方法(即对
Sequence
类进行子类化)优于上述方法(当然,如果您撇开了可能需要编写更多代码的事实)并且副作用较少(因为您可以丢弃单张图片,而不是整个批次)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.