简体   繁体   中英

Ensemble in R using SVM

I'm trying to classify some data using SVM in R.

The data set:

D1 | D2 | D3 | word1 | word2 |...
1  | 2  | 3  | 0     | 1     |
3  | 2  | 1  | 1     | 0     |

D1, D2, D3 take values from 0 to 9 and each word takes a 0/1 value.

First I want to build a classificator that predicts D1 based on word1, word2, etc. Then I want to build a classificator that predicts D2 based on what it predicted in D1 and the words. D1, D2 and D3 used to be a single number of 3 digits and there is a relation between a digit and the prior one.

So far I have:

trainD1 <- train[,-1]
trainD1$D2 <- NULL
trainD1$D3 <- NULL

modelD1 <- svm( train$D1~., trainD1, type="C-classification")

But I'm completely lost, any help is welcome.

Thanks

I'm sure you already know this but I just want to make sure I cover my bases -- if D1 and D2 are predictive of D3 then it will always be better to use the actual values of D1 and D3 rather than predictions of them.

I will assume for the purposes of this question that D1 and D2 may not be present in your prediction data set, so that's why you have to predict them. It may still be more accurate to directly predict D3 from the "word" variables, but that's outside of the scope of this question.

train <- read.csv("trainingSmallExtra.csv")

require(e1071)
d1 <- svm(  x = train[,5:100], # arbitrary subset of words
            y = train$D1,
            gamma = 0.1)

d1.predict <- predict(d1)
train      <- cbind(d1.predict, train)
x_names    <- c("d1.predict", train[,6:101])

d2 <- svm(  x = x_names,  # d1 prediction + arbitrary subset of words
            y = train$D2,
            gamma = 0.1)

d2.predict <- predict(d2)
train      <- cbind(d2.predict, train)

x_names <- c("d1.predict", "d2.predict", colnames(train)[25:150]) 

final <- svm(  x = train[,x_names], 
               y = train$D3,
               gamma = 0.1)

summary(final)

Call: svm.default(x = train[, x_names], y = train$D3, gamma = 0.1)

Parameters: SVM-Type: eps-regression SVM-Kernel: radial

  cost: 1 gamma: 0.1 epsilon: 0.1 

Number of Support Vectors: 932

This is just to show you the process. In your code you will want to use more of the words and set whatever options you think are most appropriate.

I recommend using a holdout sample or cross-validation for benchmarking performance. Compare the ensemble model with a single model that tries to predict D3 directly from the words by examining their performance benchmarks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM