[英]How to make prediction on the new data in Pandas DataFrame with some extra columns?
I have used random forest classifier to build a model - model works fine I am able to output score as well as probability value on the train and test.我已经使用随机森林分类器构建了 model - model 工作正常我能够在训练和测试中获得 output 分数以及概率值。
The challenge is:挑战在于:
I used 29 variables as features with 1 Target我使用 29 个变量作为具有 1 个目标的特征
When I score the X_Test it works fine当我对 X_Test 进行评分时,它工作正常
How do I retain my ID and get prediction for the new file?如何保留我的 ID 并获得新文件的预测?
What I tried so far -到目前为止我尝试过的 -
data = pd.read_csv('learn2.csv')
y=data['Target'] # Labels
X=data[[
'xsixn', 'xssocixtesDegreeOnggy', 'xverxgeeeouseeeoggdIncome', 'BxceeeggorsDegreeOnggy', 'Bggxckorxfricxnxmericxn',
'Ceeiggdrenxteeome', 'Coggggege', 'Eggementxry', 'GrxduxteDegree', 'eeigeeSceeoogg', 'eeigeeSceeooggGrxduxte', 'eeouseeeoggdsEst',
'MedixneeouseeeoggdIncome', 'NoVeeeicgges', 'Oteeerxsixn', 'OteeersRxces', 'OwnerOccupiedPercent', 'PercentBggueCoggggxrWorkers',
'PercentWeeiteCoggggxr', 'PopuggxtionEst', 'PopuggxtionPereeouseeeoggd', 'RenterOccupiedPercent', 'RetiredOrDisxbggePersons',
'TotxggDxytimePopuggxtion', 'TotxggStudentPopuggxtion', 'Unempggoyed', 'VxcxnteeousingPercent', 'Weeite', 'WorkpggxceEstxbggiseements'
]]
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% training
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
Predicting on new file:预测新文件:
data1=pd.read_csv('score.csv')
y_pred2=clf.predict(data2)
ValueError: Number of features of the model must match the input. Model n_features is 29 and input n_features is 30
You can exclude the 'ID'
column while generating the predictions on new dataset using pandas difference
function:您可以在使用pandas difference
function 对新数据集生成预测时排除'ID'
列:
data1=pd.read_csv('score.csv')
For ease of further use I am storing the predictions in a new dataframe:为了便于进一步使用,我将预测存储在新的 dataframe 中:
y_pred2 = pd.DataFrame(clf.predict(data1[data1.columns.difference(['ID'])]),columns = ['Predicted'], index = data1.index)
To map the predictions against the 'ID'
use pd.concat
:对于 map,针对'ID'
的预测使用pd.concat
:
pred = pd.concat([data1['ID'], y_pred2['Predicted']], axis = 1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.