I have created a decision tree model in R. The target variable is Salary, where we are trying to predict if the salary of a person is above or below 50k based on the other input variables
df<-salary.data
train = sample(1:nrow(df), nrow(df)/2)
train = sample(1:nrow(df), size=0.2*nrow(df))
test = - train
training_data = df[train, ]
testing_data = df[test, ]
fit <- rpart(training_data$INCOME ~ ., method="class", data=training_data)##generate tree
testing_data$predictionsOutput = predict(fit, newdata=testing_data, type="class")##make prediction
After that I tried to create a Gain chart by doing the following
# Gain Chart
pred <- prediction(testing_data$predictionsOutput, testing_data$INCOME)
gain <- performance(pred,"tpr","fpr")
plot(gain, col="orange", lwd=2)
By looking at the reference I am unable to understand how to use the ROCR package to build the chart by using the 'Prediction' function. Is this only for binary target variables? I get the error saying 'format of predictions is invalid'
Any help with this would be much appreciated to help me build a Gain chart for the above model. Thanks!!
AGE EMPLOYER DEGREE MSTATUS JOBTYPE SEX C.GAIN C.LOSS HOURS
1 39 State-gov Bachelors Never-married Adm-clerical Male 2174 0 40
2 50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Male 0 0 13
3 38 Private HS-grad Divorced Handlers-cleaners Male 0 0 40
COUNTRY INCOME
1 United-States <=50K
2 United-States <=50K
3 United-States <=50K
Convert the prediction to a vector, using c()
library('rpart')
library('ROCR')
setwd('C:\\Users\\John\\Google Drive\\working\\R\\questions')
df<-read.csv(file='salary-class.csv',header=TRUE)
train = sample(1:nrow(df), nrow(df)/2)
train = sample(1:nrow(df), size=0.2*nrow(df))
test = - train
training_data = df[train, ]
testing_data = df[test, ]
fit <- rpart(training_data$INCOME ~ ., method="class", data=training_data)##generate tree
testing_data$predictionsOutput = predict(fit,
newdata=testing_data, type="class")##make prediction
# Doesn't work
# pred <- prediction(testing_data$predictionsOutput, testing_data$INCOME)
v <- c(pred = testing_data$predictionsOutput)
pred <- prediction(v, testing_data$INCOME)
gain <- performance(pred,"tpr","fpr")
plot(gain, col="orange", lwd=2)
This should work if you change
predict(fit, newdata=testing_data, type="class")
to
predict(fit, newdata=testing_data, type="prob")
The gains chart wants to rank-order by model probability.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.