为什么 train_test_split 需要很长时间才能运行？

Question

I'm using Google colab, and I'm trying to train a convolutional neural network.我正在使用 Google colab，并且正在尝试训练卷积神经网络。 For splitting the dataset of roughly 11,500 images and each data is of shape 63x63x63.用于拆分大约 11,500 张图像的数据集，每个数据的形状为 63x63x63。 I used train_test_split from sklearn .我使用了train_test_split的sklearn 。

test_split = 0.1
random_state = 42
X_train, X_test, y_train, y_test = train_test_split(triplets, df.label, test_size = test_split, random_state = random_state)

Every time my runtime disconnects, I need to run this to proceed further.每次我的运行时断开连接时，我都需要运行它以进一步进行。 However this command alone takes close to 10 minutes (or probably more) to run.但是，仅此命令就需要将近 10 分钟（或可能更多）才能运行。 Every other command in the notebook runs pretty fast (in maybe a few seconds or lesser).笔记本中的所有其他命令都运行得非常快（可能在几秒钟或更短的时间内）。 I'm not really sure what the issue is;我不太确定问题是什么； I tried changing the runtime to GPU, and my internet connection seems to be quite stable.我尝试将运行时更改为 GPU，我的互联网连接似乎相当稳定。 What can the issue probably be?问题可能是什么？

Answer 1

Why taking so much time?为什么要花这么多时间？

Your data shape is 11500x63x63x63.您的数据形状为 11500x63x63x63。 It is usual to take this long time, as the data shape is massive.通常需要很长时间，因为数据形状非常庞大。

Explanation: As the data shape is 11500x63x63x63, there are approximately 3x10^9 memory locations in your data (the actual value is 2,875,540,500).说明：由于数据形状为 11500x63x63x63，因此您的数据中大约有 3x10^9 个 memory 位置（实际值为 2,875,540,500）。 Generally, a machine can perform 10^7~10^8 instructions per second.一般一台机器每秒可以执行10^7~10^8条指令。 As python is relatively slow, I consider google-colab to be able to execute 10^7 instructions per second, then,由于 python 比较慢，我认为 google-colab 每秒可以执行 10^7 条指令，然后，

Minimum time needed for train_test_split = 3x10^9 / 10^7 = 300 seconds = 5 minutes train_test_split 所需的最短时间 = 3x10^9 / 10^7 = 300 秒 = 5 分钟

However, the actual time complexity of train_test_split function is almost close to O(n) , but due to huge data manipulation, this function results to be having a bottleneck based on the huge data passing and retrieval operations.然而， train_test_split function 的实际时间复杂度几乎接近O(n) ，但是由于大量的数据操作，这个 function 导致基于大量的数据传递和检索操作存在瓶颈。 This results in your script having a time consumption of almost double.这导致您的脚本的时间消耗几乎翻了一番。

How to solve it?如何解决？

A simple solution would be to pass the indexes of the feature dataset instead of directly passing the feature dataset (in this case, the feature dataset is triplets ).一个简单的解决方案是传递要素数据集的索引，而不是直接传递要素数据集（在这种情况下，要素数据集是triplets ）。 This would cut off the extra time required to copy the returned training and testing features inside of the train_test_split function.这将切断在 train_test_split function 中复制返回的训练和测试特征所需的额外时间。 This may result a boost in performance depending on the data type you are currently using.这可能会提高性能，具体取决于您当前使用的数据类型。

To further explain what I am talking about, I am adding a short code,为了进一步解释我在说什么，我添加了一个短代码，

# Building a index array of the input feature
X_index = np.arange(0, 11500)

# Passing index array instead of the big feature matrix
X_train, X_test, y_train, y_test = train_test_split(X_index, df.label, test_size=0.1, random_state=42)

# Extracting the feature matrix using splitted index matrix
X_train = triplets[X_train]
X_test = triplets[X_test]

In the above code, I am passing the index of the input features, and splitting it according to the train_test_split function.在上面的代码中，我传递了输入特征的索引，并根据 train_test_split function 进行拆分。 Further, I am manually extracting the train and test dataset to reduce the time complexity of returning a big matrix.此外，我手动提取训练和测试数据集以降低返回大矩阵的时间复杂度。

The estimated time improvement depends on the data type you are currently using.估计的时间改进取决于您当前使用的数据类型。 To further strengthen my answer, I am adding a benchmark using NumPy matrix and data types tested on google-colab.为了进一步加强我的回答，我添加了一个使用 NumPy 矩阵和在 google-colab 上测试的数据类型的基准。 The benchmark code and output are given below.下面给出了基准代码和 output。 However, in some cases, it does not improve too much as seemed in the benchmark.但是，在某些情况下，它并没有像基准测试中那样改善太多。

Code:代码：

import timeit
import numpy as np
from sklearn.model_selection import train_test_split

def benchmark(dtypes):
    for dtype in dtypes:
        print('Benchmark for dtype', dtype, end='\n'+'-'*40+'\n')
        X = np.ones((5000, 63, 63, 63), dtype=dtype)
        y = np.ones((5000, 1), dtype=dtype)
        X_index = np.arange(0, 5000)

        start_time = timeit.default_timer()
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
        print(f'Time elapsed: {timeit.default_timer()-start_time:.3f}')

        start_time = timeit.default_timer()
        X_train, X_test, y_train, y_test = train_test_split(X_index, y, test_size=0.1, random_state=42)

        X_train = X[X_train]
        X_test = X[X_test]
        print(f'Time elapsed with indexing: {timeit.default_timer()-start_time:.3f}')
        print()

benchmark([np.int8, np.int16, np.int32, np.int64, np.float16, np.float32, np.float64])

Output: Output：

Benchmark for dtype <class 'numpy.int8'>
----------------------------------------
Time elapsed: 0.473
Time elapsed with indexing: 0.304

Benchmark for dtype <class 'numpy.int16'>
----------------------------------------
Time elapsed: 0.895
Time elapsed with indexing: 0.604

Benchmark for dtype <class 'numpy.int32'>
----------------------------------------
Time elapsed: 1.792
Time elapsed with indexing: 1.182

Benchmark for dtype <class 'numpy.int64'>
----------------------------------------
Time elapsed: 2.493
Time elapsed with indexing: 2.398

Benchmark for dtype <class 'numpy.float16'>
----------------------------------------
Time elapsed: 0.730
Time elapsed with indexing: 0.738

Benchmark for dtype <class 'numpy.float32'>
----------------------------------------
Time elapsed: 1.904
Time elapsed with indexing: 1.400
    
Benchmark for dtype <class 'numpy.float64'>
----------------------------------------
Time elapsed: 5.166
Time elapsed with indexing: 3.076

为什么 train_test_split 需要很长时间才能运行？

问题描述

1 个解决方案

解决方案1
1 2020-07-19 10:38:18

Why taking so much time?为什么要花这么多时间？

How to solve it?如何解决？

Code:代码：

Output: Output：

为什么 train_test_split 需要很长时间才能运行？

问题描述

1 个解决方案

解决方案1 1 2020-07-19 10:38:18

Why taking so much time?为什么要花这么多时间？

How to solve it?如何解决？

Code:代码：

Output: Output：

解决方案1
1 2020-07-19 10:38:18