如何从 tf.keras.preprocessing.image_dataset_from_directory() 探索和修改创建的数据集？

Question

Here's how I used the function:以下是我如何使用该功能：

dataset = tf.keras.preprocessing.image_dataset_from_directory(
    main_directory,
    labels='inferred',
    image_size=(299, 299),
    validation_split=0.1,
    subset='training',
    seed=123
)

I'd like to explore the created dataset much like in this example , particularly the part where it was converted to a pandas dataframe.我想像在这个例子中一样探索创建的数据集，特别是它被转换为pandas数据帧的部分。 But my minimum goal is to check the labels and the number of files attached to it, just to check if, indeed, it created the dataset as expected (sub-directory being the corresponding label of images inside it).但我的最低目标是检查标签和附加到它的文件数量，只是为了检查它是否确实按预期创建了数据集（子目录是其中图像的相应标签）。

To be clear, the main_directory is set up like this:需要明确的是， main_directory是这样设置的：

main_directory
- class_a
  - 000.jpg
  - ...
- class_b
  - 100.jpg
  - ...

And I'd like to see the dataset display its info with something like this:我希望看到数据集以如下方式显示其信息：

label     number of images
class_a   100
class_b   100

Additionally, is it possible to remove labels and corresponding images in a dataset?此外，是否可以删除数据集中的标签和相应的图像？ The idea is to drop them if the corresponding number of images is less than a certain number, or a different metric.这个想法是如果相应的图像数量小于某个数量或不同的度量标准，则删除它们。 It can be of course done outside this function through other means, but I'd like to know if it is indeed possible, and if so, how.它当然可以通过其他方式在这个函数之外完成，但我想知道它是否确实可能，如果是，如何。

EDIT: For additional context, the end goal of all of this is to train a pre-trained model like this with local images divided into folders named after their classes.编辑：对于额外的上下文，所有这些的最终目标是训练一个像这样的预训练模型，将本地图像划分为以类别命名的文件夹。 If there is a better way that includes not using that function and meets this end goal, it's welcome all the same.如果有更好的方法，包括不使用该功能并满足此最终目标，则同样欢迎。 Thanks!谢谢！

Answer 1

I think it would be much easier to use glob2 to get all your filenames, process them as you want to, then make a simple loading function that will replace image_dataset_from_directory .我认为使用glob2获取所有文件名，根据需要处理它们，然后创建一个简单的加载函数来替换image_dataset_from_directory会image_dataset_from_directory 。

Get all your files:获取所有文件：

files = glob2.glob('class_*\\*.jpg')

Then manipulate this list of filenames as desired.然后根据需要操作此文件名列表。

Then, make a function to load the images:然后，创建一个函数来加载图像：

def load(file_path):
    img = tf.io.read_file(file_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)
    img = tf.image.resize(img, size=(299, 299))
    label = tf.strings.split(file_path, os.sep)[0]
    label = tf.cast(tf.equal(label, 'class_a'), tf.int32)
    return img, label

Then create your dataset for training:然后创建用于训练的数据集：

train_ds = tf.data.Dataset.from_tensor_slices(files).map(load).batch(4)

Then train:然后训练：

model.fit(train_ds)

如何从 tf.keras.preprocessing.image_dataset_from_directory() 探索和修改创建的数据集？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-10-15 12:57:06

如何从 tf.keras.preprocessing.image_dataset_from_directory() 探索和修改创建的数据集？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-10-15 12:57:06

解决方案1
2 已采纳 2020-10-15 12:57:06