I'm trying to combine multiple lm
outputs into a data frame, for further calculations. I have a dataset of 1000 observations and 62 variables. The project is to randomly split the dataset 63/37, train the model, repeat this 1000 times and save the coefficients, the fitted values, and the r2 for all 1000 runs. So I'm doing most of that here (using mtcars
):
data("mtcars")
f <- function () {
fit <- lm(mpg ~ ., data = mtcars, subset = sample <- sample.int(n = nrow(mtcars), size = floor(.63*nrow(mtcars)), replace = F))
coef(fit)
}
output <- t(replicate(1000, f()))
I know I can get the rsq values with summary(fit)$r.squared
and I can use predict()
to get the fitted values. I'm just struggling with how to get them into the data frame with the saved coefficients.
The below should do
get_model <- function (input_data) {
fit <- lm(mpg ~ .,
data = mtcars,
subset = sample <- sample.int(n = nrow(mtcars),
size = floor(.63*nrow(mtcars)), replace = F)
)
return(fit)
}
get_results <- function(lm_model){
data <- data.frame()
data <- rbind(data, coef(lm_model))
data <- cbind(data, summary(lm_model)$r.squared)
colnames(data) <- c(names(mtcars), "rsquared")
return(data)
}
# running the above
input_data <- mtcars
general_df <- data.frame()
for(i in 1:1000){
my_model <- get_model(input_data)
final_data <- get_results(my_model)
general_df <- rbind(general_df, final_data)
}
You are very close:
library(tidyverse)
library(modelr)
data("mtcars")
get_data_lm <- function(data_df, testPCT = 0.37){
data_resample <- modelr::crossv_mc(data_df, n = 1, test = testPCT)
fit <- lm(mpg ~ ., data = as.data.frame(data_resample$train))
stats <- c(coef(fit),
"R2" = summary(fit)$r.squared,
"AdjR2" = summary(fit)$adj.r.squared)
pred_vals <- predict(fit, newdata = as.data.frame(data_resample$test))
c(stats, pred_vals)
}
output <- t(replicate(1000, get_data_lm(mtcars)))
The only thing you needed to do is concatenate the other statistics and predicted values you want. Alternatively, you could use a parallel sapply()
variant to make your simulation considerably faster.
Another comment: I use the crossv_mc()
function from the modelr::
package to create one testing and training partition. However, I could have used n = 1000
outside the function instead; this would have created a resample data frame in my working environment for me to apply()
a function over. See the modelr::
GitHub page for more info.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.