简体   繁体   中英

Different results when using train_test_split vs manually splitting the data

I have a pandas dataframe that I want to make predictions on and get the root mean squared error for each feature. I'm following an online guide that splits the dataset manually, but I thought it would be more convenient to use train_test_split from sklearn.model_selection . Unfortunately, I'm getting different results when looking at the rmse values after splitting the data manually vs using train_test_split .

A (hopefully) reproducible example:

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['feature_1','feature_2','feature_3','feature_4'])
df['target'] = np.random.randint(2,size=100)
df2 = df.copy()

Here is a function, knn_train_test , that splits the data manually, fits the model, makes predictions, etc:

def knn_train_test(train_col, target_col, df):
    knn = KNeighborsRegressor()
    np.random.seed(0)

    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)

    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]

    # Fit a KNN model using default k value.
    knn.fit(train_df[[train_col]], train_df[target_col])

    # Make predictions using model.
    predicted_labels = knn.predict(test_df[[train_col]])

    # Calculate and return RMSE.
    mse = mean_squared_error(test_df[target_col], predicted_labels)
    rmse = np.sqrt(mse)
    return rmse

rmse_results = {}
train_cols = df.columns.drop('target')

# For each column (minus `target`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
    rmse_val = knn_train_test(col, 'target', df)
    rmse_results[col] = rmse_val

# Create a Series object from the dictionary so 
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()

#Output
feature_3    0.541110
feature_2    0.548452
feature_4    0.559285
feature_1    0.569912
dtype: float64

Now, here is a function, knn_train_test2, that splits the data using train_test_split :

def knn_train_test2(train_col, target_col, df2):

    knn = KNeighborsRegressor()
    np.random.seed(0)

    X_train, X_test, y_train, y_test = train_test_split(df2[[train_col]],df2[[target_col]], test_size=0.5)

    knn.fit(X_train,y_train)

    predictions = knn.predict(X_test)

    mse = mean_squared_error(y_test,predictions)

    rmse = np.sqrt(mse)

    return rmse

rmse_results = {}
train_cols = df2.columns.drop('target')

for col in train_cols:
    rmse_val = knn_train_test2(col, 'target', df2)
    rmse_results[col] = rmse_val


rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()

# Output
feature_4    0.522303
feature_3    0.556417
feature_1    0.569210
feature_2    0.572713
dtype: float64

Why am I getting different results? I think I'm misunderstanding the split > train > test process in general, or maybe misunderstanding/mis-specifying train_test_split . Thank you in advance

Splitting data manually is just slicing but train_test_split will also randomize the sliced data. Try fix the random number seed and see if you can get same results each time when using train_test_split .

This is basic machine learning nature. When you manually split the data, you have a different version of training and testing set. When you use the sklearn function, you get different training and testing set. Your model will make prediction based on what training data it recieves and thus your final results are different for both.

If you want to reproduce result, then use the train_test_split to create multiple training set by setting a seed value. A seed value is used to reproduce the same result in the train_test_split function. Then when running your ml function, set a seed in there too as even ML functions start training with random weights. Try your model on these datasets with same seed and you will get the results.

Your custom train_test_split implementation differs from scikit-learn's implementation, that's why you get different results for the same seed.

Here you can find the official implementation. The first thing which is notable is, that scikit-learn is doing by default 10 iterations of re-shuffeling & splitting. (check the n_splits parameter)

Only if your approach is doing exactly the same as the scitkit-learn approach, then you can expect to have the same result for the same seed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM