简体   繁体   English

如何运行for循环以通过虚拟变量运行回归

[英]How to run a for loop to run regressions by dummy variables

I have the following code: 我有以下代码:

reg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2 + d2 + d3 + d4, df)

Where all x_i are continuous variables and d_i are mutually exclusive dummy variables (d1 is present but exclude to avoid perfect multicollinearity). 其中所有x_i是连续变量,而d_i是互斥的虚拟变量(存在d1,但为了避免完全多重共线性而排除了)。 Rather than including the dummy variables, I want to run separate regressions for each dummy variable == 1. I wish to achieve this through a loop in the following form: 我不想为每个虚拟变量== 1运行单独的回归,而是希望通过以下形式的循环来实现此目的:

dummylist <- list("d1", "d2", "d3", "d4")
for(i in dummylist){
   if(i==1){
      ireg <- lm(Y ~ x1 + x1_sq + x2 + x2_sq + x1x2, df)
   } else {
      Unsure what to put here
   }
}

My three(?) questions are: 我的三个问题是:

  1. in the first section of the -if- function, do I just include "i" before "reg" for my code to generate results "d1reg, d2reg, etc."? 在-if-函数的第一部分中,我是否只为代码生成“ d1reg,d2reg等”结果而在“ reg”之前包含“ i”? and, 和,
  2. included in the code above, what would I put after the -else- statement? 包括在上面的代码中,在-else-语句之后我应该输入什么?
  3. This all begs the question, is putting an if-else statement within the -for- loop the wrong approach/is there a more appropriate loop? 这一切都引出了一个问题,是否在if-else语句放入-for-循环是错误的方法/是否有更合适的循环?

Sorry if this is too much, please let me know if it is and I can cut it down or separate into multiple questions. 抱歉,如果太多,请让我知道,我可以将其减少或分成多个问题。 I could not find a similar question, probably as I am rather new to running loops in R and don't know what to look for. 我找不到类似的问题,可能是因为我刚开始在R中运行循环并且不知道要查找什么。

  1. in the first section of the -if- function, do I just include "i" before "reg" for my code to generate results "d1reg, d2reg, etc."? 在-if-函数的第一部分中,我是否只为代码生成“ d1reg,d2reg等”结果而在“ reg”之前包含“ i”?

Short: No 简短:

In R there are many data types. 在R中,有许多数据类型。 One of the more versatile once is the list object, which can store any type of object. list对象是用途最广泛的一种,它可以存储任何类型的对象。 Alternatively one could create an environment to store the lists within, but that is a bit overkill. 或者,可以创建一种environment来存储列表,但这有点过头了。

If you know roughly how many elements should be in your list, the easiest is to initialize it prior to your loop as 如果您大致知道列表中应包含多少个元素,最简单的方法是在循环之前将其初始化为

n <- 3
regList <- vector(mode = "list", length = n)
# Optional naming:
#names(regList) <- c("d1 reg", "d2 reg", "d3 reg")

In your loop you then fill in your list iteratively: 在循环中,然后迭代地填写列表:

for(i in seq_along(regList)){
   regList[[i]] <- lm(...)
}
  1. what would I put after the -else- statement? 在-else-语句之后我该怎么办? This all begs the question, 这一切都引出一个问题,

It is not entirely clear what you want here. 目前尚不清楚您想要什么。 Either you want to 'only' include the seperate dummy variables. 您只想“包括”单独的虚拟变量。 For this the simplest is likely to save your formula and updating it iteratively. 为此,最简单的方法可能是保存formula并进行迭代更新。

form <- Y ~ x1 + x1_sq + x2 + x2_sq + x1x2
for(i in seq_along(regList)){
   #paste0 combine strings. ". ~ . + d1" means take the formula and add the element d1 
   form <- update(form, as.formula(paste0(". ~ . + d", i)) 
   regList[[i]] <- lm(form, data = df)
}

or maybe you are actually trying to run separate regressions on the subset where d[i] == 1 . 也许您实际上正在尝试对d[i] == 1的子集运行单独的回归。 This can actually be done with lm itself 这实际上可以通过lm本身完成

form <- Y ~ x1 + x1_sq + x2 + x2_sq + x1x2
d <- list(d1, d2, d3)
for(i in seq_along(regList)){
   #Using the subset argument
   regList[[i]] <- lm(form, data = df, subset = which(d[[i]] == 1))
   #Alternatively:
   #regList[[i]] <- lm(form, data = subset(df, d[[i]] == 1))
}

Disclaimer: It is not entirely clear if d1, d2, d3 is a part of df. 免责声明: d1,d2,d3是否为df的一部分尚不清楚。 In this case the example below would work 在这种情况下,以下示例将起作用

   regList[[i]] <- with(df, lm(form, subset = which(d[[i]] == 1)))
  1. is putting an if-else statement within the -for- loop the wrong approach/is there a more appropriate loop? 是将if-else语句放在-for-循环内是错误的方法/是否存在更合适的循环?

In this case it is not clearly the correct approach. 在这种情况下,这显然不是正确的方法。 But it isn't the wrong approach either in all circumstances. 但这在所有情况下都不是错误的方法。 Here it just doesn't serve a clear purpose. 在这里,它只是没有明确的目的。 And note that i in dummylist would return "d1", "d2", "d3", "d4" as the variables have been quoted, rather than directly placed within the list. 并请注意, i in dummylisti in dummylist将返回"d1", "d2", "d3", "d4"因为变量已被引用,而不是直接放置在列表中。

However another thing to address, is whether you have transformed the variables yourself, before performing your linear regression. 但是要解决的另一件事是,在执行线性回归之前,是否已对变量进行了变换。 Note that R 's internal function allows you to do this directly in the formula , and doing this will allow it to help you avoid dummy-mistakes, such as testing variables for which an interaction exists, unless it is very very much what you wanted to do. 请注意, R的内部函数允许您直接在formula执行此操作,并且这样做可以帮助您避免虚假的错误,例如测试存在交互的变量,除非它非常符合您的要求去做。 For example i assume x1_sq = x1^2 . 例如我假设x1_sq = x1^2 Maybe d1, d2, d3 are all contained in a variable d ? 也许d1, d2, d3都包含在变量d In these cases you should use the original variables as shown below: 在这些情况下,应使用原始变量,如下所示:

lm(formula = Y ~ poly(x1, 2, raw = TRUE) + poly(x2, 2, raw = TRUE) + x1:x2, data = df ) #+d if d1, d2, d3 is part of the formula

poly being the second order polynomial and raw = TRUE returning the parameters as x1 + I(x1^2) rather than the orthogonal representation. poly是二阶多项式,并且raw = TRUE返回参数x1 + I(x1^2)而不是正交表示。

If one does this, the output of drop1 , anova etc. will take into account that it should not test the first order variables to the second order interactions. 如果一个人这样做,输出drop1anova等,将考虑到它不应该考第一顺序变量二阶相互作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM