简体   繁体   中英

What is the difference between model.fit(X,y), and model.fit(train_X, train_y)

When I'm studying kaggle micro course (machine learning), I learned how to find the optimum leaf size (by finding the minimum MAE). However, I got different MAE value when I put the optimum leaf size into final model. For example,

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(preds_val, val_y)
    return mae

candidates_leaf_nodes = list(range(5, 500))
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidates_leaf_nodes}
best_leaf_size = min(scores, key = scores.get)
best_model = DecisionTreeRegressor(max_leaf_nodes = best_leaf_size, random_state = 0)
best_model.fit(X,y)
best_preds = best_model.predict(val_X)
best_mae = mean_absolute_error(best_preds, val_y)

print("best_leaf_size: {:,.0f}".format(best_leaf_size))
print("Validation MAE for best value of best_leaf_size: {:,.0f}".format(get_mae(best_leaf_size, train_X, val_X, train_y, val_y)))
print("Validation MAE for best value of best_leaf_size: {:,.0f}".format(best_mae))

The result showed

best_leaf_size: 71

Validation MAE for best value of best_leaf_size: 26,704

Validation MAE for best value of best_leaf_size: 18,616

I got 26,704 of MAE when I used.fit(train_X, train_y) and I got 18,616 of MAE when I used.fit(X, y).

So, I wonder why I got two different values, which means what is the difference between.fit(train_X, train_y) and.fit(X, y).

Thank you.

model.fit(X,y) represents that we are using all our give datasets to train the model and the same datasets will be used to evaluate the model ie our training and test datasets will be same which will not give the correct results. So, best idea is to divide the datasets into two parts ie Training data and the Test data. Here, features(X) and the values(y) both will be divided.

X divides into train_X, test_X and y divides into train_y and test_y

The split is based on a random number generator. Supplying a numeric value to the random_state argument guarantees we get the same split every time we run the script.

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)

You are fitting the model with same parameters but on two different datasets, one on train_X and another on X . Based on the distribution of the datasets you will have different MAE scores.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM