简体   繁体   English

如何对新数据使用决策树回归器? (Python、熊猫、Sklearn)

[英]How do I use Decision Tree Regressor on new data? (Python, Pandas, Sklearn)

I've started learning python and machine learning very recently.我最近开始学习 Python 和机器学习。 I have been doing a basic Decision Tree Regressor example involving house prices.我一直在做一个涉及房价的基本决策树回归器示例。 So I have trained the algorithm and found the best number of branches but how do I use this on new data?所以我已经训练了算法并找到了最好的分支数量,但是我如何在新数据上使用它?

I have the below columns and my target value is 'SalePrice'我有以下列,我的目标值是“SalePrice”

['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

Obviously for the original data I already have the SalePrice so I can compare the values.显然,对于原始数据,我已经有了 SalePrice,因此我可以比较这些值。 How would I go about finding the price if I only have the columns above?如果我只有上面的列,我将如何找到价格?

Full code below完整代码如下

import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = 'train.csv'

home_data = pd.read_csv(iowa_file_path)
#Simplify data to remove useless info
SimpleTable=home_data[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd','SalePrice']]
# Create target object and call it y # input target value
y = home_data.SalePrice 
# Create X input columns names to be analysed
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0, test_size=0.8, train_size=0.2)


# Specify Model
iowa_model = DecisionTreeRegressor(random_state=0)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)

val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))


def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# to find best number of leaves
candidate_max_leaf_nodes = [10, 20, 50, 100, 200, 400] # start with big numbers are work your way down
for max_leaf_nodes in candidate_max_leaf_nodes:
    my_mae=get_mae(max_leaf_nodes,train_X,val_X,train_y,val_y)
    print("MAX leaf nodes: %d \t\t Mean Absolute Error:%d" %(max_leaf_nodes,my_mae))




scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}

best_tree_size = min(scores, key=scores.get)
print(best_tree_size)


#run on all data and put back into data fram 
final_model=DecisionTreeRegressor(max_leaf_nodes=best_tree_size,random_state=0)
final_model.fit(X,y)
final_model.predict(X)

final_predictions = final_model.predict(X)
finaltableinput = {'Predicted_Price':final_predictions}
finaltable = pd.DataFrame(finaltableinput)
SimpleTable.head()

jointable = SimpleTable.join(finaltable)

#export data with predicted values to csv
jointable.to_csv('newdata4.csv')




Thanks in Advance提前致谢

If you want to know the price (Y) given the independent variables (X) with an already trained model, you need to use the predict() method.如果你想知道给定自变量 (X) 的价格 (Y) 和已经训练好的模型,你需要使用predict()方法。 This means that based on the model your algorithm developed with the training, it will use the variables to predict the SalePrice .这意味着基于您的算法在训练中开发的模型,它将使用变量来预测SalePrice I see you've already used .predict() in your code.我看到您已经在代码中使用了.predict()

You should start by defining the variable, for example:您应该首先定义变量,例如:

X_new = df_new[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']] #Let's say this is a pandas dataframe
new_sale_price = final_model.predict(X_new) #This will return an array
df_new['SalePrice'] = new_sale_price #The length will be of equal length so you should have no trouble.

You can do this is one line as well:您也可以在一行中执行此操作:

df_new['SalePrice'] = final_model.predict(X_new) 

Of course, since you don't know the real SalePrice for those values of X you can't do a performance check.当然,由于您不知道X这些值的真实SalePrice ,因此您无法进行性能检查。 This is what happens in the real world whenever you want to make predictions or forecasting of prices based on a group of variables, you need to train your model to achieve it's peak performance, and then do the prediction with it!这就是现实世界中发生的情况,每当您想根据一组变量进行价格预测或预测时,您需要训练您的模型以达到其最佳性能,然后用它进行预测! Feel free to leave any question in the comments if you have doubts.如果您有任何疑问,请随时在评论中留下任何问题。

The Decision Tree algorithm is a supervised learning model, which means that in order to train it you must supply the model with data of the features as well as of the target ('Sale Price' in your case).决策树算法是一种监督学习模型,这意味着为了训练它,您必须为模型提供特征数据以及目标数据(在您的情况下为“销售价格”)。

If you want to apply machine learning in a case where you don't have the target data, you have to use an unsupervised model.如果要在没有目标数据的情况下应用机器学习,则必须使用无监督模型。

A very basic introduction to these different kinds of learning models can be foundhere .可以在此处找到对这些不同类型学习模型的非常基本的介绍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM