简体   繁体   English

在R中一次输出一个变量的每个变量的回归统计

[英]Output Regression statistics for each variable one at a time in R

I have a data frame that looks like this. 我有一个看起来像这样的数据框。 names and number of columns will NOT be consistent (sometimes 'C' will not be present, other times "D', 'E', 'F' may be present, etc.). The only consistent variable will always be Y, and I want to regress against Y. 名称和列数将不一致(有时“C”将不存在,其他时间“D”,“E”,“F”可能存在等等。)唯一一致的变量将始终为Y,并且我想倒退Y.

# name and number of columns varies...so need flexible process
Y <- c(4, 4, 3, 4, 3, 2, 3, 2, 2, 3, 4, 4, 3, 4, 8, 6, 5, 4, 3, 6)
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
YABC <- data.frame(Y, A, B, C)

I want to loop through each variable and collect output from regression model. 我想循环遍历每个变量并从回归模型中收集输出。

This process creates the desired output, but only for this specific iteration. 此过程创建所需的输出,但仅适用于此特定迭代。

model_A <- lm(Y ~ A, YABC)

ID <- 'A'
rsq <- summary(model_A)$r.squared
adj_rsq <- summary(model_A)$adj.r.squared
sig <- summary(model_A)$sigma

datA <- data.frame(ID, rsq, adj_rsq, sig)

model_B <- lm(Y ~ B, YABC)

ID <- 'B'
rsq <- summary(model_B)$r.squared
adj_rsq <- summary(model_B)$adj.r.squared
sig <- summary(model_B)$sigma

datB <- data.frame(ID, rsq, adj_rsq, sig)

model_C <- lm(Y ~ C, YABC)

ID <- 'C'
rsq <- summary(model_C)$r.squared
adj_rsq <- summary(model_C)$adj.r.squared
sig <- summary(model_C)$sigma

datC <- data.frame(ID, rsq, adj_rsq, sig)

output <- rbind(datA, datB, datC)

How can I wrap this in a loop or some other process that will account for varied number and name of columns? 我如何将它包装在一个循环或一些其他过程中,这些过程将考虑不同数量和列的名称? Here is my attempt...yes I know it's not right, just me conceptualizing the kind of capability I'd like. 这是我的尝试......是的,我知道这是不对的,只是我概念化我喜欢的那种能力。

# initialize data frame
output__ <- data.frame(ID__ = as.character(),
                     rsq__ = as.numeric(),
                     adj_rsq__ = as.numeric(),
                     sig__ = as.numeric())

# loop through A, then B, then C
for(i in A:C) {
  model_[i] <- lm(Y ~ [i], YABC)

  ID <- '[i]'
  rsq <- summary(model_[i])$r.squared
  adj_rsq <- summary(model_[i])$adj.r.squared
  sig <- summary(model_[i])$sigma
  data__temp <- (ID__, rsq__, adj_rsq__, sig__)
  data__ <- rbind(data__, data__temp)
}

Using @BigDataScientist approach...here is the solution I went with. 使用@BigDataScientist方法......这是我使用的解决方案。

# initialize data frame
data__ <- data.frame(ID__ = as.character(),
                     rsq__ = as.numeric(),
                     adj_rsq__ = as.numeric(),
                     sig__ = as.numeric())

# loop through A, then B, then C
for(char in names(YABC)[-1]){
  model <- lm(as.formula(paste("Y ~ ", char)), YABC)
  ID__ <- paste(char)
  rsq__ <- summary(model)$r.squared
  adj_rsq__ <- summary(model)$adj.r.squared
  sig__ <- summary(model)$sigma
  data__temp <- data.frame(ID__, rsq__, adj_rsq__, sig__)
  data__ <- rbind(data__, data__temp)

}

Here is a solution using *apply: 这是使用* apply的解决方案:

Y <- c(4, 4, 3, 4, 3, 2, 3, 2, 2, 3, 4, 4, 3, 4, 8, 6, 5, 4, 3, 6)
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
YABC <- data.frame(Y, A, B, C)

names <- colnames(YABC[-1])

formulae <- sapply(names,function(x)as.formula(paste('Y~',x)))

lapply(formulae, function(x) lm(x, data = YABC))

Of course you can also call summary: 当然你也可以打电话给摘要:

lapply(formulae, function(x) summary(lm(x, data = YABC)))

If you want to extract variables from a specific model do as follows: 如果要从特定模型中提取变量,请执行以下操作:

results <- lapply(formulae, function(x) lm(x, data = YABC))
results$A$coefficients

gives the coefficients from the model using A as explanatory var 使用A作为解释性变量,从模型给出系数

As written in the comment: ?as.formula() is one solution. 正如评论中所写: ?as.formula()是一种解决方案。 You could do sthg like: 你可以这样做:

model = list()
for(char in names(YABC)[-1]) {
  model[[char]] <- lm(as.formula(paste("Y ~ ", char)), YABC)
}
model

This how I do this kind of modeling. 这是我如何进行这种建模的。 Following example assumes I am varying different outcomes, and different exposures for a given set of covariates. 下面的例子假设我改变了不同的结果,并给出了一组给定的协变量的不同暴露。

I first define my outcomes and exposures I want to test (I think in terms of epidemiology but you can extend). 我首先定义了我想测试的结果和暴露(我认为在流行病学方面你可以扩展)。

outcomes <- c("a","b","c","d")

exposures <- c("exp1","exp2","exp3")

The assumption is that each element specified in those vectors exist as column names in your dataset (as well as the covariates listed below after the "~"). 假设这些向量中指定的每个元素都作为数据集中的列名存在(以及“〜”之后列出的下面的协变量)。

final_lm_data <- data.frame() #initialize empty dataframe to hold results
for (j in 1:length(exposures){
  for (i in 1:length(outcomes){
    mylm <- lm(formula(paste(outcomes[i], "~", "continuous.cov.1 + 
        continuous.cov.2 + factor(categorical.variable.1)", "+",
                             exposure[j])), data=mydata)

    coefficent.table <- as.data.frame(coef(summary(mylm)))

    mylm_data <- as.data.frame(cbind(ctable,Variable = rownames(ctable),
                                     Outcome = outcomes[i],
                                     Exposure = exposures[j],
                                     Model_N = paste(length(mylm$residuals))))
    names(mylm_data)[4] <- "Pvalue"  # renaming the "Pr(>|t|)"
    rownames(mylm_data) <- NULL # important because we are creating stacked output dataset
    final_lm_data <- rbind(final_lm_data,mylm_data)
  }
}

This will give you a final_lm_data that contains your estimates, std.errors, tstatistics, pvalues for each variable in your model, and also keep track of the iteration of Outcome and Exposure (first and last elements of your model). 这将为您提供final_lm_data ,其中包含模型中每个变量的估计值,std.errors,tstatistics,pvalues,还可以跟踪Outcome和Exposure的迭代(模型的第一个和最后一个元素)。 Lastly, it has the N used after dropping data records for missing values. 最后,在删除缺失值的数据记录之后使用N. You can modify the mylm_data creation to capture more information from the model (such as rsq etc..). 您可以修改mylm_data创建以从模型中捕获更多信息(例如rsq等)。

Finally, if covariates also vary from run to run, I am not sure how to automate that part. 最后,如果协变量也因运行而异,我不确定如何自动化该部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM