简体   繁体   中英

random forest variable lengths differ

I am trying to run RF using a feature as the response variable. I am having trouble passing a string through a variable to be used as the response in RF. First I try running RF on the string passed through a variable as the response and I am getting a "vector lengths differ error". After this, I try just inputing the actual string(feature) as the response and it works fine. Can you shed some light on why the variable lengths are differing? Thanks.

> colnames(Data[1])
[1] "feature1"
> rf.file = randomForest(formula =colnames(Data[1])~ ., data = Data, proximity = T,      importance = T, ntree = 500, nodesize = 3)
Error in model.frame.default(formula = colnames(Data[1]) ~ .,  : 
  variable lengths differ (found for 'feature1')

Enter a frame number, or 0 to exit   

1: randomForest(formula = colnames(Data[1]) ~ ., data = Data, proximity = T, importance = T, ntree = 500, nodesize = 3)
2: randomForest.formula(formula = colnames(Data[1]) ~ ., data = brainDataTrim, proximity = T, importance = T, ntree = 500, nodesize = 3)
3: eval(m, parent.frame())
4: eval(expr, envir, enclos)
5: model.frame(formula = colnames(Data[1]) ~ ., data = Data, na.action = function (object, ...) 
6: model.frame.default(formula = colnames(Data[1]) ~ ., data = Data, na.action = function (object, ...) 

Selection: 0



> rf.file = randomForest(formula =feature1~ ., data = Data, proximity = T,      importance = T, ntree = 500, nodesize = 3)
> rf.file

Call:
 randomForest(formula = feature1 ~ ., data = Data,      proximity = T, importance = T, ntree = 500, nodesize = 3) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 0.1536834
                    % Var explained: 34.21
> 

You are simply misunderstanding how formulas work. Basically, your first attempt isn't supposed to work.

Formulas should consist of names of variables, possibly simple functions of them. eg

var1 ~ var2
var1 ~ log(var2)

Note the lack of quotes. If you didn't quote it, it's not a string, its a symbol.

So, avoid raw strings, weird evaluation demands (like Data[1] , or any use of $ ) in your formulas. To construct a formula from strings, paste it together and then call as.formula on the resulting string.

Keep in mind that the whole point of a formula is that you have provided a symbolic representation of the model, and R will then go look for the specific columns you named in the data frame provided.

I think some functions will do the coercion of a string representation of a formula for you (eg "var1 ~ var2" ), but I wouldn't count on, or expect it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM