使用train_test_split与手动拆分数据时的结果不同

Question

I have a pandas dataframe that I want to make predictions on and get the root mean squared error for each feature. 我有一个熊猫数据框，我想对其进行预测并得到每个特征的均方根误差。 I'm following an online guide that splits the dataset manually, but I thought it would be more convenient to use train_test_split from sklearn.model_selection . 我正在按照手动分割数据集的在线指南，但我认为从sklearn.model_selection使用train_test_split会更方便。 Unfortunately, I'm getting different results when looking at the rmse values after splitting the data manually vs using train_test_split . 不幸的是，在使用train_test_split手动分割数据后查看rmse值时，我得到的结果不同。

A (hopefully) reproducible example: 一个（希望）可重复的例子：

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['feature_1','feature_2','feature_3','feature_4'])
df['target'] = np.random.randint(2,size=100)
df2 = df.copy()

Here is a function, knn_train_test , that splits the data manually, fits the model, makes predictions, etc: 这是一个函数knn_train_test ，它可以手动分割数据，适合模型，进行预测等：

def knn_train_test(train_col, target_col, df):
    knn = KNeighborsRegressor()
    np.random.seed(0)

    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)

    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]

    # Fit a KNN model using default k value.
    knn.fit(train_df[[train_col]], train_df[target_col])

    # Make predictions using model.
    predicted_labels = knn.predict(test_df[[train_col]])

    # Calculate and return RMSE.
    mse = mean_squared_error(test_df[target_col], predicted_labels)
    rmse = np.sqrt(mse)
    return rmse

rmse_results = {}
train_cols = df.columns.drop('target')

# For each column (minus `target`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
    rmse_val = knn_train_test(col, 'target', df)
    rmse_results[col] = rmse_val

# Create a Series object from the dictionary so 
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()

#Output
feature_3    0.541110
feature_2    0.548452
feature_4    0.559285
feature_1    0.569912
dtype: float64

Now, here is a function, knn_train_test2, that splits the data using train_test_split : 现在，这是一个函数knn_train_test2，它使用train_test_split分割数据：

def knn_train_test2(train_col, target_col, df2):

    knn = KNeighborsRegressor()
    np.random.seed(0)

    X_train, X_test, y_train, y_test = train_test_split(df2[[train_col]],df2[[target_col]], test_size=0.5)

    knn.fit(X_train,y_train)

    predictions = knn.predict(X_test)

    mse = mean_squared_error(y_test,predictions)

    rmse = np.sqrt(mse)

    return rmse

rmse_results = {}
train_cols = df2.columns.drop('target')

for col in train_cols:
    rmse_val = knn_train_test2(col, 'target', df2)
    rmse_results[col] = rmse_val


rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()

# Output
feature_4    0.522303
feature_3    0.556417
feature_1    0.569210
feature_2    0.572713
dtype: float64

Why am I getting different results? 为什么我会得到不同的结果？ I think I'm misunderstanding the split > train > test process in general, or maybe misunderstanding/mis-specifying train_test_split . 我想我总是误解了分裂>训练>测试过程，或者误解/错误指定train_test_split 。 Thank you in advance 先感谢您

Answer 1

Splitting data manually is just slicing but train_test_split will also randomize the sliced data. 手动拆分数据只是切片，但train_test_split也会随机化切片数据。 Try fix the random number seed and see if you can get same results each time when using train_test_split . 尝试修复随机数种子，看看每次使用train_test_split时是否可以获得相同的结果。

Answer 2

This is basic machine learning nature. 这是基本的机器学习性质。 When you manually split the data, you have a different version of training and testing set. 手动拆分数据时，您将拥有不同版本的培训和测试集。 When you use the sklearn function, you get different training and testing set. 当您使用sklearn功能时，您将获得不同的培训和测试集。 Your model will make prediction based on what training data it recieves and thus your final results are different for both. 您的模型将根据收到的训练数据进行预测，因此您的最终结果会有所不同。

If you want to reproduce result, then use the train_test_split to create multiple training set by setting a seed value. 如果要重现结果，请使用train_test_split通过设置种子值来创建多个训练集。 A seed value is used to reproduce the same result in the train_test_split function. 种子值用于在train_test_split函数中重现相同的结果。 Then when running your ml function, set a seed in there too as even ML functions start training with random weights. 然后在运行你的ml函数时，在那里设置一个种子，因为即使ML函数也开始用随机权重训练。 Try your model on these datasets with same seed and you will get the results. 在具有相同种子的这些数据集上尝试您的模型，您将获得结果。

Answer 3

Your custom train_test_split implementation differs from scikit-learn's implementation, that's why you get different results for the same seed. 您的自定义train_test_split实现与scikit-learn的实现不同，这就是您为同一种子获得不同结果的原因。

Here you can find the official implementation. 在这里您可以找到官方实施。 The first thing which is notable is, that scikit-learn is doing by default 10 iterations of re-shuffeling & splitting. 值得注意的第一件事是，scikit-learn在默认情况下进行了10次重新混乱和分裂。 (check the n_splits parameter) （查看n_splits参数）

Only if your approach is doing exactly the same as the scitkit-learn approach, then you can expect to have the same result for the same seed. 只有当您的方法与scitkit-learn方法完全相同时 ，您才能期望对同一种子具有相同的结果。

使用train_test_split与手动拆分数据时的结果不同

问题描述

3 个解决方案

解决方案1
1 2019-05-10 03:45:33

解决方案2
1 2019-05-10 03:49:15

解决方案3
1 已采纳 2019-05-10 06:04:14

使用train_test_split与手动拆分数据时的结果不同

问题描述

3 个解决方案

解决方案1 1 2019-05-10 03:45:33

解决方案2 1 2019-05-10 03:49:15

解决方案3 1 已采纳 2019-05-10 06:04:14

解决方案1
1 2019-05-10 03:45:33

解决方案2
1 2019-05-10 03:49:15

解决方案3
1 已采纳 2019-05-10 06:04:14