繁体   English   中英

我应该如何改变加载数据集的方式,以便可以利用 kaggles TPU

[英]How should I change the way I load my dataset so I can take advantage of kaggles TPU

I am using the Keras API of tensorflow train a model that can detect what characters of the Kannada language script are in an image, Kannada is a South Indian language that can have upwards of 657 classes for classification as characters are combinations of consonants and vowels. 为进一步了解,请参阅Wikipedia 文章。

这个model的数据集是一个有多个子目录的单个目录,每个子目录对应一个class,如下所示:目录结构

或者,如果您在此处查看kaggle 公共链接,您可以更清楚地看到结构。

以下是我做的进口:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense, Flatten, BatchNormalization, Conv2D, MaxPool2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator

我使用 ImageDataGenerator 加载数据,因为我可以轻松地将数据集拆分为单独的训练和验证集。 下面是我用来构造这两组的代码:

# Creating training and validation data generators
datagen=ImageDataGenerator(validation_split=0.01)

train_generator=datagen.flow_from_directory(
directory="../input/kannada-images-with-noise/Images_with_noise",
subset="training",
batch_size=256,
shuffle=True,
classes=image_classes,
color_mode='grayscale',
target_size=(75,75))

valid_generator=datagen.flow_from_directory(
directory="../input/kannada-images-with-noise/Images_with_noise",
subset="validation",
batch_size=256,
shuffle=True,
classes=image_classes,
color_mode='grayscale',
target_size=(75,75))

# Creating step sizes
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=valid_generator.n//valid_generator.batch_size

然后我就像这样将这些生成器传递给 model.fit() function

# Training our model
model.fit(
    x=train_generator, 
    steps_per_epoch=STEP_SIZE_TRAIN, 
    validation_data=valid_generator,
    validation_steps=STEP_SIZE_VALID,
    epochs=25,
    verbose=1
)

到目前为止,我一直坚持使用这种方法,因为它简单明了。 但是,如果我想使用 kaggle 上可用的 TPU,我将不得不更改加载数据的方式并使用 tf.data.Dataset,因为 ImageDataGenerator 无法使用 kaggle 数据集的 Google Cloud Service 链接来获取数据。

如何使用 tf.data.Dataset 加载我的数据? 如果您能指出我可以遵循的任何示例或教程的链接,我将不胜感激。 如果更改目录的结构方式对我来说更好,请告诉我必须如何做。

我知道有两种方法请注意,TPUClusterResolver 的 tpu 参数是专为 Colab 提供的特殊地址。 如果您在 Google Compute Engine (GCE) 上运行,则应改为传入 CloudTPU 的名称。

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))


INFO:tensorflow:Initializing the TPU system: grpc://10.240.1.74:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.240.1.74:8470
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Finished initializing TPU system.
All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type

手动放置设备 初始化 TPU 后,您可以使用手动放置设备将计算放置在单个 TPU 设备上。

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
with tf.device('/TPU:0'):
  c = tf.matmul(a, b)
print("c device: ", c.device)
print(c)
c device:  /job:worker/replica:0/task:0/device:TPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

大多数时候,用户希望以数据并行的方式在多个 TPU 上运行 model。 分发策略是一种抽象,可用于在 CPU、GPU 或 TPU 上驱动模型。 只需更换分发策略,model 将在给定设备上运行。

strategy = tf.distribute.TPUStrategy(resolver)
INFO:tensorflow:Found TPU system:
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)

要复制计算以便它可以在所有 TPU 内核中运行,您只需将其传递给 strategy.run API。 下面是一个示例,所有核都将获得相同的输入(a,b),并在每个核上独立地执行 matmul。 输出将是所有副本的值。

@tf.function
def matmul_fn(x, y):
  z = tf.matmul(x, y)
  return z

z = strategy.run(matmul_fn, args=(a, b))
print(z)
PerReplica:{
  0: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  1: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  2: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  3: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  4: tf.Tensor(

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM