我应该如何改变加载数据集的方式，以便可以利用 kaggles TPU

Question

I am using the Keras API of tensorflow train a model that can detect what characters of the Kannada language script are in an image, Kannada is a South Indian language that can have upwards of 657 classes for classification as characters are combinations of consonants and vowels. 为进一步了解，请参阅此Wikipedia 文章。

这个model的数据集是一个有多个子目录的单个目录，每个子目录对应一个class，如下所示：目录结构

或者，如果您在此处查看kaggle 公共链接，您可以更清楚地看到结构。

以下是我做的进口：

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Flatten, BatchNormalization, Conv2D, MaxPool2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator

我使用 ImageDataGenerator 加载数据，因为我可以轻松地将数据集拆分为单独的训练和验证集。 下面是我用来构造这两组的代码：

# Creating training and validation data generators
datagen=ImageDataGenerator(validation_split=0.01)

train_generator=datagen.flow_from_directory(
directory="../input/kannada-images-with-noise/Images_with_noise",
subset="training",
batch_size=256,
shuffle=True,
classes=image_classes,
color_mode='grayscale',
target_size=(75,75))

valid_generator=datagen.flow_from_directory(
directory="../input/kannada-images-with-noise/Images_with_noise",
subset="validation",
batch_size=256,
shuffle=True,
classes=image_classes,
color_mode='grayscale',
target_size=(75,75))

# Creating step sizes
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size

然后我就像这样将这些生成器传递给 model.fit() function

# Training our model
model.fit(
    x=train_generator, 
    steps_per_epoch=STEP_SIZE_TRAIN, 
    validation_data=valid_generator,
    validation_steps=STEP_SIZE_VALID,
    epochs=25,
    verbose=1
)

到目前为止，我一直坚持使用这种方法，因为它简单明了。 但是，如果我想使用 kaggle 上可用的 TPU，我将不得不更改加载数据的方式并使用 tf.data.Dataset，因为 ImageDataGenerator 无法使用 kaggle 数据集的 Google Cloud Service 链接来获取数据。

如何使用 tf.data.Dataset 加载我的数据？ 如果您能指出我可以遵循的任何示例或教程的链接，我将不胜感激。 如果更改目录的结构方式对我来说更好，请告诉我必须如何做。

Answer 1

我知道有两种方法请注意，TPUClusterResolver 的 tpu 参数是专为 Colab 提供的特殊地址。 如果您在 Google Compute Engine (GCE) 上运行，则应改为传入 CloudTPU 的名称。

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

INFO:tensorflow:Initializing the TPU system: grpc://10.240.1.74:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.240.1.74:8470
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Finished initializing TPU system.
All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type

手动放置设备初始化 TPU 后，您可以使用手动放置设备将计算放置在单个 TPU 设备上。

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
with tf.device('/TPU:0'):
  c = tf.matmul(a, b)
print("c device: ", c.device)
print(c)

c device:  /job:worker/replica:0/task:0/device:TPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

大多数时候，用户希望以数据并行的方式在多个 TPU 上运行 model。 分发策略是一种抽象，可用于在 CPU、GPU 或 TPU 上驱动模型。 只需更换分发策略，model 将在给定设备上运行。

strategy = tf.distribute.TPUStrategy(resolver)

INFO:tensorflow:Found TPU system:
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)

要复制计算以便它可以在所有 TPU 内核中运行，您只需将其传递给 strategy.run API。 下面是一个示例，所有核都将获得相同的输入（a，b），并在每个核上独立地执行 matmul。 输出将是所有副本的值。

@tf.function
def matmul_fn(x, y):
  z = tf.matmul(x, y)
  return z

z = strategy.run(matmul_fn, args=(a, b))
print(z)

PerReplica:{
  0: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  1: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  2: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  3: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  4: tf.Tensor(

我应该如何改变加载数据集的方式，以便可以利用 kaggles TPU

问题描述

1 个解决方案

解决方案1
0 2021-05-09 07:45:02

我应该如何改变加载数据集的方式，以便可以利用 kaggles TPU

问题描述

1 个解决方案

解决方案1 0 2021-05-09 07:45:02

解决方案1
0 2021-05-09 07:45:02