简体   繁体   中英

Apply Machine learning process simultaneously on multiple datasets with R

I want to delete correlated variables and perform lasso regression on multiple datasets. So i divided my data in two lists: first list contains variables and the second contains targets.

I want also to divide my data into train and test before applying Lasso, making predictions and store tthe results in a final dataframe.

The main steps:

1- Correlation: (delete correlated variables)

2- divide data inton train and test

3- Perform LASSO

4- Make predictions

5- store predictions in a dataframe with their labels

Thanks!

set.seed(99)
 library("caret")
  # Create data frames
 H <- data.frame(replicate(10,sample(0:20,10,rep=TRUE)))   
 C <- data.frame(replicate(5,sample(0:100,10,rep=FALSE)))
 R <- data.frame(replicate(7,sample(0:30,10,rep=TRUE)))
 E <- data.frame(replicate(4,sample(0:40,10,rep=FALSE)))
  
 # Create target variables
 Y_H <- data.frame(replicate(1,sample(20:35, 10, rep = TRUE)))
 Y_H
 names(Y_H)<-names(Y_H)[names(Y_H)=="replicate.1..sample.20.35..10..rep...TRUE.."] <-"label_1"

 Y_C <- data.frame(replicate(1,sample(15:65, 10, rep = TRUE)))

 names(Y_C) <- names(Y_C)[names(Y_C)=="replicate.1..sample.15.65..10..rep...TRUE.."] <-"label_2" 

 Y_R <- data.frame(replicate(1,sample(25:45, 10, rep = TRUE)))
 names(Y_R) <-names(Y_R)[names(Y_R) == "replicate.1..sample.25.45..10..rep...TRUE.."] <- "label_3"


 Y_E <- data.frame(replicate(1,sample(21:80, 10, rep = TRUE)))
 names(Y_E) <-names(Y_E)[names(Y_E) == "replicate.1..sample.15.65..10..rep...TRUE.."] <- "label_4"

 # Store observations and targets in lists
 inputs <- list(H, C, R, E)

 targets <- list(Y_H, Y_C, Y_R, Y_E)

# Perform correlation
 outputs <- list()


 for(df in inputs){
     data.cor <- cor(df)
     high.cor <- findCorrelation(data.cor, cutoff=0.40)
     outputs <- append(outputs, list(df[,-high.cor]))
 }

library("glmnet")

lasso_cv <- list()
lasso_model <- list()

for(i in outputs){
   for(j in targets){
      lasso_cv[i] <- cv.glmnet(as.matrix(outputs[[i]]), as.matrix(targets[[j]]), standardize = TRUE, type.measure="mse",  alpha = 1,nfolds = 3)

      lasso_model[i] <- glmnet(as.matrix(outputs[[i]]), as.matrix(targets[[j]]),lambda = lasso_cv[i]$lambda_cv, alpha = 1, standardize = TRUE)

   }
}

When i run my for loop, it gives this error:

Error in h(simpleError(msg, call)) : 
erreur d'�valuation de l'argument 'x' lors de la s�lection d'une 
m�thode pour la fonction 'as.matrix' : invalid subscript type 'list'

It seems to me that the error is in the range of the last for loop.

You wrote for(i in outputs) , and then used as.matrix(outputs[[i]]) . So, at the first iteration you are basically calling as.matrix(outputs[[outputs[[1]]) , which does not make sense. Similar reasoning applies to for(j in targets) .

Try to replace the code I quoted by for(i in seq_len(length(outputs))) and for(i in seq_len(length(targets))) . That should work. In this way, at the first iteration as.matrix(outputs[[i]]) translates to as.matrix(outputs[[1]]) , and similarly for the other line, which it seems to me is the idea you were looking for.

Ps I am not sure about your code. If we check, lasso_cv[i]$lambda_cv returns NULL for every i. Maybe you can check into it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM