I have a dataframe dfA (the real one has 1000 rows and 400,000 columns). From column 6 on, the variable names are "triads" formed by x with + different prefixes (GT_x, N_x, E_x), where x = rs1, rs7, rs300, rs502, etc:
ID SEX PV GAN GAE GT_rs1 N_rs1 E_rs1 GT_rs7 N_rs7 E_rs7 ...
2 0 7.8 0.3 0.4 0 1 1 1 0 2 ...
6 1 6.4 0.35 0.55 0 0 1 1 1 2 ...
Here is a reproducible example of my data:
dfA = data.frame(rbind(c("ID","SEX","PV","GAN","GAE","GT_rs1","N_rs1","E_rs1","GT_rs7","N_rs7","E_rs7"),
c(2,0,7.8,0.3,0.4,0,1,1,1,0,2),
c(6,1,6.4,0.35,0.55,0,0,1,1,1,2)))
dfA = dfA %>% row_to_names(row_number = 1)
Using R, I want to run a linear regression of the form:
lm(PV ~ SEX + GAN + GT_x + N_x)
where x is rs1, rs7 and so on. So, I'd need to iterate column-wise over pairs of variables. I would like to get estimate, std.error, statistic and p.value for the different covariates (SEX, GAN, GT_x and N_x). SEX = categorical variable; PV, GAN = quantitative variables; GT_x, N_x, E_x = additive variables.
Here a solution with purrr
in one simple pipeline.
You just need to create a list of GT_x
and N_x
to use. You can do it by using some regex.
library(purrr)
nn <- names(df)
pattern <- "^GT_|^N_"
vars <- nn[grepl(pattern, nn)] # get the variables that start with GT_ and N_
x <- sub(pattern, "", vars) # get every x
split(vars, x) %>%
map(paste, collapse = " + ") %>%
sprintf("PV ~ SEX + GAN + %s", .) %>%
map(lm, data = df) %>%
map_dfr(broom::tidy, .id = "model")
This returns a unique Dataframe. Each model is identified by the column model
. If you prefer a list just replace map_dfr
with map
and remove .id
.
Here I created a reproducible example of your data:
set.seed(1)
df <- data.frame(ID = 1:1000,
SEX = sample(0:1, 1000, replace = TRUE),
PV = rnorm(1000),
GAN = rnorm(1000),
GAE = rnorm(1000))
newcols <- unlist(lapply(c("GT_rs", "N_rs", "E_rs"), paste0, sample(100, 50)))
df[newcols] <- replicate(50, rnorm(1000))
df
You can build formulas by pasting together strings - we just need to know the strings you want to paste together.
This should work - it's untested because the data you share isn't shared with dput
so it's not copy/pasteable, and it only has one set of covariates so it doesn't illustrate the complexity of the problem. If you have issues, please share copy/pasteable data to illustrate and I'll try to debug.
library(stringr)
library(dplyr)
library(broom)
# get all unique strings after underscores from your column names
suffix = str_extract(names(dfA), "_.*") %>% na.omit %>% unique
prefix = c("GT", "N")
base_formula = "PV ~ SEX + GAN +"
full_formula = paste(base_formula, paste0(prefix[1], suffix), "+", paste0(prefix[2], suffix))
mods = list()
for(i in seq_along(full_formula)) {
mods[[suffix[i]]] = lm(as.formula(full_formula[i]), data = dfA)
}
stats = lapply(mods, tidy)
stats = bind_rows(stats, .id = "suffix")
Since Edo has edited its solution, I add my variant of it:
library(purrr)
library(dplyr)
library(broom)
list("GT_rs", "N_rs") %>%
map(~dfA %>%
select(matches(paste0(.x,"\\d+"))) %>%
names %>%
sub(pattern = .x, replacement = "")) %>%
reduce(intersect) %>% # until here we get the variables GT_rsx, N_rsx
sprintf("PV ~ SEX + GAN + GT_rs%s + N_rs%s", ., .) %>%
map(lm, data = dfA) %>%
map_dfr(tidy, .id = "model") %>%
group_by(model) %>%
mutate(suffix = sub("N_rs", "", term[grepl("^N_rs\\d+$", term)]))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.