Tensorflow 将数据集拆分为训练和测试导致瓶颈/缓慢

Question

I have a dataset, and when I do preprocessing on it with ds = ds.map(process_path, num_parallel_calls=AUTOTUNE).prefetch(AUTOTUNE) that line executes very fast.我有一个数据集，当我使用ds = ds.map(process_path, num_parallel_calls=AUTOTUNE).prefetch(AUTOTUNE)进行预处理时，该行执行速度非常快。 When I then try to access one of the dataset's elements with:然后，当我尝试使用以下方法访问数据集的元素之一时：

for image, label in ds.take(1):
  print(image.shape)
  image = tf.squeeze(image)
  plt.imshow(image, cmap='gray')

It takes a second or two to load;加载需要一两秒钟； that's my first question:这是我的第一个问题：

Does the preprocessing only get run on the dataset when an element from the dataset is accessed, and not immediately when I call ds.map(process_path,...)?预处理是否仅在访问数据集中的元素时才在数据集上运行，而不是在我调用 ds.map(process_path,...) 时立即运行？

However my main issue is that when I split the dataset ds into two, training and testing, and try to access one of the elements again, it is considerably slow... Like 20x slower.然而，我的主要问题是，当我将数据集ds分成两个，训练和测试，并尝试再次访问其中一个元素时，它相当慢......就像慢了 20 倍。 I split it into two with:我把它分成两部分：

test_ds_size = int(image_count * 0.2)
train_ds = ds.skip(test_ds_size)
test_ds = ds.take(test_ds_size)

I then try to access it in the same way I do as above but replacing ds with train_ds ;然后我尝试以与上述相同的方式访问它，但将ds替换为train_ds ； my second question is:我的第二个问题是：

Why is this considerably slower, just from splitting it into two?为什么这要慢得多，只是将它分成两部分？

Or am I doing something very wrong...还是我做错了什么...

Any help is greatly appreciated.任何帮助是极大的赞赏。

Answer 1

dataset.map creates a new dataset by applying map function. dataset.map通过应用 map function 创建一个新数据集。
Even in the loop also when you perform dataset.take() it creates a new dataset from the number specified with very less time.即使在循环中，当您执行dataset.take()时，它也会以非常短的时间从指定的数字创建一个新数据集。
After the dataset is loaded you are performing other operations which is not relevant to tf.data performance.加载数据集后，您正在执行与tf.data性能无关的其他操作。
You can check from the below example.您可以从以下示例中进行检查。

import tensorflow as tf
from time import time

dataset = tf.data.Dataset.range(1, 100)
t1 = time()
dataset = dataset.map(lambda x: x + 1)
t2 = time()
print("Time taken for map : ", t2-t1)

t3 = time()
ds = dataset.take(50)
t4 = time()
list(ds.as_numpy_iterator())
print("Time taken for take() : ",t4-t3) 

Time taken for map :  0.013489961624145508
Time taken for take() :  0.0005645751953125

Now, let's see the time taken by take() after some operation.现在，让我们看看 take() 在一些操作之后所花费的时间。

dataset = tf.data.Dataset.range(1, 100)
t1 = time()
dataset = dataset.map(lambda x: x + 1)
t2 = time()
print("Time taken for map : ", t2-t1)

t3 = time()
ds = dataset.take(50)
list(ds.as_numpy_iterator())
t4 = time()
print("Time taken for take() after some operation : ",t4-t3)

Time taken for map :  0.00974416732788086
Time taken for take() after some operation :  0.017722606658935547

Coming to the split of train and test data from the existing dataset can be done in the way you have specified, but it will take time since it iterates through all the elements.可以按照您指定的方式从现有数据集中拆分训练数据和测试数据，但这需要时间，因为它会遍历所有元素。

The ideal way to create tf.data.Dataset for train and test is to create it separately like shownhere .为训练和测试创建tf.data.Dataset的理想方法是分别创建它，如下所示。 Make sure you shuffle the data before to have the proper distribution of your dataset in both train and test data.确保在训练和测试数据中正确分布数据集之前对数据进行混洗。

Tensorflow 将数据集拆分为训练和测试导致瓶颈/缓慢

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-22 05:39:55

Tensorflow 将数据集拆分为训练和测试导致瓶颈/缓慢

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-22 05:39:55

解决方案1
1 已采纳 2021-01-22 05:39:55