如何通过一些额外的列对 Pandas DataFrame 中的新数据进行预测？

Question

I have used random forest classifier to build a model - model works fine I am able to output score as well as probability value on the train and test.我已经使用随机森林分类器构建了 model - model 工作正常我能够在训练和测试中获得 output 分数以及概率值。

The challenge is:挑战在于：

I used 29 variables as features with 1 Target我使用 29 个变量作为具有 1 个目标的特征
When I score the X_Test it works fine当我对 X_Test 进行评分时，它工作正常
When I bring in a new data set which has 29 variables and my Unique ID /primary key - model errors out saying its looking for 29 variables当我引入一个包含 29 个变量和我的唯一 ID /主键的新数据集时 - model 出错，说它正在寻找 29 个变量

How do I retain my ID and get prediction for the new file?如何保留我的 ID 并获得新文件的预测？

What I tried so far -到目前为止我尝试过的 -

data = pd.read_csv('learn2.csv')
y=data['Target']  # Labels

X=data[[
        'xsixn',    'xssocixtesDegreeOnggy',    'xverxgeeeouseeeoggdIncome',    'BxceeeggorsDegreeOnggy',   'Bggxckorxfricxnxmericxn',  
'Ceeiggdrenxteeome',    'Coggggege',    'Eggementxry',  'GrxduxteDegree',   'eeigeeSceeoogg',   'eeigeeSceeooggGrxduxte',   'eeouseeeoggdsEst', 
'MedixneeouseeeoggdIncome', 'NoVeeeicgges', 'Oteeerxsixn',  'OteeersRxces', 'OwnerOccupiedPercent', 'PercentBggueCoggggxrWorkers',
    'PercentWeeiteCoggggxr',    'PopuggxtionEst',   'PopuggxtionPereeouseeeoggd',   'RenterOccupiedPercent',    'RetiredOrDisxbggePersons',
    'TotxggDxytimePopuggxtion', 'TotxggStudentPopuggxtion', 'Unempggoyed',  'VxcxnteeousingPercent',    'Weeite',   'WorkpggxceEstxbggiseements'

        ]]

# Import train_test_split function
from sklearn.model_selection import train_test_split

    # Split dataset into training set and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% training

    #Import Random Forest Model
    from sklearn.ensemble import RandomForestClassifier

    #Create a Gaussian Classifier
    clf=RandomForestClassifier(n_estimators=100)


    #Train the model using the training sets y_pred=clf.predict(X_test)
    clf.fit(X_train,y_train)
    y_pred=clf.predict(X_test)

Predicting on new file:预测新文件：

data1=pd.read_csv('score.csv')
y_pred2=clf.predict(data2)

ValueError: Number of features of the model must match the input. Model n_features is 29 and input n_features is 30

Answer 1

You can exclude the 'ID' column while generating the predictions on new dataset using pandas difference function:您可以在使用pandas difference function 对新数据集生成预测时排除'ID'列：

data1=pd.read_csv('score.csv')

For ease of further use I am storing the predictions in a new dataframe:为了便于进一步使用，我将预测存储在新的 dataframe 中：

y_pred2 = pd.DataFrame(clf.predict(data1[data1.columns.difference(['ID'])]),columns = ['Predicted'], index = data1.index)

To map the predictions against the 'ID' use pd.concat :对于 map，针对'ID'的预测使用pd.concat ：

pred = pd.concat([data1['ID'], y_pred2['Predicted']], axis = 1)

如何通过一些额外的列对 Pandas DataFrame 中的新数据进行预测？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-08 08:42:02

如何通过一些额外的列对 Pandas DataFrame 中的新数据进行预测？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-08 08:42:02

解决方案1
1 已采纳 2020-04-08 08:42:02