R. 多元线性回归在变量对上逐列迭代

Question

I have a dataframe dfA (the real one has 1000 rows and 400,000 columns).我有一个数据框 dfA（真实的有 1000 行和 400,000 列）。 From column 6 on, the variable names are "triads" formed by x with + different prefixes (GT_x, N_x, E_x), where x = rs1, rs7, rs300, rs502, etc:从第 6 列开始，变量名称是由 x 和不同前缀（GT_x、N_x、E_x）组成的“三元组”，其中 x = rs1、rs7、rs300、rs502 等：

ID    SEX    PV    GAN    GAE    GT_rs1    N_rs1    E_rs1    GT_rs7    N_rs7    E_rs7    ...
2    0    7.8    0.3    0.4    0    1    1    1    0    2    ...
6    1    6.4    0.35    0.55    0    0    1    1    1    2    ...

Here is a reproducible example of my data:这是我的数据的可重现示例：

dfA = data.frame(rbind(c("ID","SEX","PV","GAN","GAE","GT_rs1","N_rs1","E_rs1","GT_rs7","N_rs7","E_rs7"), 
                   c(2,0,7.8,0.3,0.4,0,1,1,1,0,2),
                   c(6,1,6.4,0.35,0.55,0,0,1,1,1,2)))
dfA = dfA %>% row_to_names(row_number = 1)

Using R, I want to run a linear regression of the form:使用 R，我想运行以下形式的线性回归：

lm(PV ~ SEX + GAN + GT_x + N_x)

where x is rs1, rs7 and so on.其中 x 是 rs1、rs7 等。 So, I'd need to iterate column-wise over pairs of variables.所以，我需要在成对的变量上逐列迭代。 I would like to get estimate, std.error, statistic and p.value for the different covariates (SEX, GAN, GT_x and N_x).我想获得不同协变量（SEX、GAN、GT_x 和 N_x）的估计值、std.error、统计量和 p.value。 SEX = categorical variable; SEX = 分类变量； PV, GAN = quantitative variables; PV, GAN = 定量变量； GT_x, N_x, E_x = additive variables. GT_x、N_x、E_x = 附加变量。

Answer 1

Here a solution with purrr in one simple pipeline.这是一个在一个简单的管道中带有purrr的解决方案。

You just need to create a list of GT_x and N_x to use.你只需要创建列表GT_x和N_x使用。 You can do it by using some regex.您可以通过使用一些正则表达式来做到这一点。

library(purrr)

nn <- names(df)
pattern <- "^GT_|^N_"

vars <- nn[grepl(pattern, nn)] # get the variables that start with GT_ and N_
x <- sub(pattern, "", vars)    # get every x

split(vars, x) %>%
 map(paste, collapse = " + ") %>% 
 sprintf("PV ~ SEX + GAN + %s", .) %>% 
 map(lm, data = df) %>% 
 map_dfr(broom::tidy, .id = "model")

This returns a unique Dataframe.这将返回一个唯一的数据帧。 Each model is identified by the column model .每个模型由列model标识。 If you prefer a list just replace map_dfr with map and remove .id .如果您更喜欢列表，只需将map_dfr替换为map并删除.id 。

Here I created a reproducible example of your data:在这里，我创建了一个可重现的数据示例：

set.seed(1)
df <- data.frame(ID = 1:1000,
                 SEX = sample(0:1, 1000, replace = TRUE),
                 PV  = rnorm(1000),
                 GAN = rnorm(1000),
                 GAE = rnorm(1000))
newcols <- unlist(lapply(c("GT_rs", "N_rs", "E_rs"), paste0, sample(100, 50)))
df[newcols] <- replicate(50, rnorm(1000))

df

Answer 2

You can build formulas by pasting together strings - we just need to know the strings you want to paste together.您可以通过将字符串粘贴在一起来构建公式 - 我们只需要知道您想要粘贴在一起的字符串。

This should work - it's untested because the data you share isn't shared with dput so it's not copy/pasteable, and it only has one set of covariates so it doesn't illustrate the complexity of the problem.这应该有效 - 它未经测试，因为您共享的数据未与dput共享，因此不可复制/粘贴，并且它只有一组协变量，因此无法说明问题的复杂性。 If you have issues, please share copy/pasteable data to illustrate and I'll try to debug.如果你有问题，请分享复制/粘贴数据来说明，我会尝试调试。

library(stringr)
library(dplyr)
library(broom)
# get all unique strings after underscores from your column names
suffix = str_extract(names(dfA), "_.*") %>% na.omit %>% unique
prefix = c("GT", "N")
base_formula = "PV ~ SEX + GAN +"
full_formula = paste(base_formula, paste0(prefix[1], suffix), "+", paste0(prefix[2], suffix))

mods = list()
for(i in seq_along(full_formula)) {
  mods[[suffix[i]]] = lm(as.formula(full_formula[i]), data = dfA)
}

stats = lapply(mods, tidy)
stats = bind_rows(stats, .id = "suffix")

Answer 3

Since Edo has edited its solution, I add my variant of it:由于江户编辑了它的解决方案，我添加了它的变体：

library(purrr)
library(dplyr)
library(broom)

list("GT_rs", "N_rs") %>% 
    map(~dfA %>%  
             select(matches(paste0(.x,"\\d+"))) %>% 
             names %>% 
             sub(pattern = .x, replacement = "")) %>% 
    reduce(intersect) %>% # until here we get the variables GT_rsx, N_rsx
    sprintf("PV ~ SEX + GAN + GT_rs%s + N_rs%s", ., .) %>%
    map(lm, data = dfA) %>%
    map_dfr(tidy, .id = "model") %>% 
    group_by(model) %>% 
    mutate(suffix = sub("N_rs", "", term[grepl("^N_rs\\d+$", term)]))

R. 多元线性回归在变量对上逐列迭代

问题描述

3 个解决方案

解决方案1
2 2020-09-29 16:40:29

解决方案2
1 已采纳 2020-09-29 16:35:48

解决方案3
1 2020-09-30 07:37:28

R. 多元线性回归在变量对上逐列迭代

问题描述

3 个解决方案

解决方案1 2 2020-09-29 16:40:29

解决方案2 1 已采纳 2020-09-29 16:35:48

解决方案3 1 2020-09-30 07:37:28

解决方案1
2 2020-09-29 16:40:29

解决方案2
1 已采纳 2020-09-29 16:35:48

解决方案3
1 2020-09-30 07:37:28