简体   繁体   English

在R中,创建模型时,是否与SAS中的by语句等效?

[英]In R, when creating a model, is there an equivalent to the by statement in SAS?

Say I have a data set that I'd like to create a lm, for each combination of variable A and B. Where A has two values: 'a' and 'b', and B has three values: 1,2,3. 假设我有一个数据集,我想为变量A和B的每种组合创建一个lm。其中A具有两个值:'a'和'b',而B具有三个值:1,2,3 。 This leaving me with six possible combinations of variables A and B. 这给了我变量A和B的六个可能组合。

This said, I would like to create six (6) models. 也就是说,我想创建六(6)个模型。 In example the first model would have the data subsetted where A = a and B = 1. 在示例中,第一个模型将具有数据子集,其中A = a和B = 1。

In SAS, in example, the code would be as follows (please note the by statement): 例如,在SAS中,代码如下(请注意by语句):

proc glm data = mydate;
by A B;
class Cat1 Cat2;
model Y = X + Cat1 + Cat2;
run;

The by statement will generate one model for combination of A and B. by语句将为A和B的组合生成一个模型。

This is really just a split-apply step: 这实际上只是一个拆分步骤:

  1. split the data into chunks 将数据分成大块

     smydate <- split(mydate, list(A = A, B = B)) 

    Each component of smydate represents the data for a particular combination of A and B . smydate每个成分代表AB特定组合的数据。 You may need to add drop = TRUE to the split call if your data doesn't have all combinations of the levels of A and B . 如果您的数据未包含AB级别的所有组合,则可能需要在split调用中添加drop = TRUE

  2. apply the lm() function over the components of the list smydate 在列表smydate的组件上应用lm()函数

     lmFun <- function(dat) { lm(y ~ x + cat1 + cat2, data = dat) } models <- lapply(smydate, lmFun) 

Now you have a list, models , where each component contains a lm object for the particular combination of A and B . 现在您有了一个models列表,其中每个组件都包含一个针对AB特定组合的lm对象。

An example (based on the one shown by rawr in the comments is: 一个示例(基于rawr在注释中显示的示例是:

models <- lapply(split(mtcars, list(mtcars$am, mtcars$gear), drop = TRUE),
                 function(x) {lm(mpg ~ wt + disp, data = x)})
str(models)
models

which gives: 这使:

> str(models, max = 1)
List of 4
 $ 0.3:List of 12
  ..- attr(*, "class")= chr "lm"
 $ 0.4:List of 12
  ..- attr(*, "class")= chr "lm"
 $ 1.4:List of 12
  ..- attr(*, "class")= chr "lm"
 $ 1.5:List of 12
  ..- attr(*, "class")= chr "lm"
> models
$`0.3`

Call:
lm(formula = mpg ~ wt + disp, data = x)

Coefficients:
(Intercept)           wt         disp  
  27.994610    -2.384834    -0.007983  


$`0.4`

Call:
lm(formula = mpg ~ wt + disp, data = x)

Coefficients:
(Intercept)           wt         disp  
   219.1047    -106.8075       0.9953  


$`1.4`

Call:
lm(formula = mpg ~ wt + disp, data = x)

Coefficients:
(Intercept)           wt         disp  
   43.27860     -3.03114     -0.09481  


$`1.5`

Call:
lm(formula = mpg ~ wt + disp, data = x)

Coefficients:
(Intercept)           wt         disp  
  41.779042    -7.230952    -0.006731  

As rawr notes in the comments, you can do this in fewer steps using by() , or any one of a number of other higher-level functions in say the plyr package, but doing things by hand at least once illustrates the generality of the approach; 如注释中的rawr注释所示,您可以使用by()或使用plyr包中的许多其他更高级别的函数中的任何一个,以较少的步骤完成此操作,但是手工操作至少说明了该方法的普遍性。方法 you can always use the short cuts once you are familiar with the general idea. 熟悉总体思路后,您始终可以使用快捷方式。

More specifically, you can use lmList to fit linear models to categories, after using @bjoseph's strategy of generating an interaction variable: 更具体地说,在使用@bjoseph的生成交互变量的策略之后,可以使用lmList将线性模型拟合到类别:

mydate <- transform(mydate, ABcat=interaction(A,B,drop=TRUE))
library("lme4")  ## or library("nlme")
lmList(Y~X+Cat1+Cat2|ABcat,mydate)

Using group_by in the dplyr package will run an analysis for each subgroup combination. dplyr包中使用group_by将对每个子组组合运行分析。 Using the mtcars dataset: 使用mtcars数据集:

library(dplyr)
res <- mtcars %>%
  group_by(am, gear) %>%
  do(mod = lm(mpg ~ wt + disp, data = .))

res$mod

Will give you the list of lm objects. 将为您提供lm对象的列表。

Other packages will make this more elegant. 其他包装将使之更加优雅。 You could do this in-line with the magrittr package and go straight to the list of lm objects: 您可以使用magrittr软件包直接进行此操作,然后直接转到lm对象列表:

library(magrittr)
mtcars %>%
  group_by(am, gear) %>%
  do(mod = lm(mpg ~ wt + disp, data = .)) %>%
  use_series(mod)

Or use the broom package to extract coefficient values from the lm objects: 或使用扫帚包从lm对象中提取系数值:

library(broom)
mtcars %>%
  group_by(am, gear) %>%
  do(mod = lm(mpg ~ wt + disp, data = .)) %>%
  glance(mod)

Source: local data frame [4 x 13]
Groups: am, gear

  am gear r.squared adj.r.squared     sigma statistic    p.value df     logLik      AIC      BIC  deviance df.residual
1  0    3 0.6223489     0.5594070 2.2379851  9.887679 0.00290098  3 -31.694140 71.38828 74.22048 60.102926          12
2  0    4 0.9653343     0.8960028 0.9899495 13.923469 0.18618733  3  -2.862760 13.72552 11.27070  0.980000           1
3  1    4 0.7849464     0.6989249 2.9709337  9.125006 0.02144702  3 -18.182504 44.36501 44.68277 44.132234           5
4  1    5 0.9827679     0.9655358 1.2362092 57.031169 0.01723212  3  -5.864214 19.72843 18.16618  3.056426           2

You could try several different things. 您可以尝试几种不同的方法。

Let's say our data is: 假设我们的数据是:

structure(list(A = structure(c(1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), B = structure(c(1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), x = c(1, 2, 3, 4), y = c(2, 2, 2, 2)), .Names = c("A", "B", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
x
#>   A B x y
   1 A A 1 2
   2 A B 2 2
   3 B A 3 2
   4 B B 4 2

by() 通过()

This returns a list-type object. 这将返回一个列表类型的对象。 Notice that it doesn't return results in the order we might have expected. 请注意,它不会按我们期望的顺序返回结果。 It's trying to keep the second factor as stable as possible when iterating. 它试图在迭代时保持第二个因素尽可能稳定。 You could adjust this by using list(x$B,x$A) 您可以使用list(x$B,x$A)

by(x[c("x","y")],list(x$A,x$B),function(x){x[1]*x[2]})
[1] 2
------------------------------------------------------------------------------------- 
[1] 6
------------------------------------------------------------------------------------- 
[1] 4
------------------------------------------------------------------------------------- 
[1] 8

expand.grid() expand.grid()

This is a simple for loop where we pre-generated the combinations of interest, subset the data in the loop and perform the function of interest. 这是一个简单的for循环,其中我们预先生成了感兴趣的组合,在循环中将数据子集并执行了感兴趣的功能。 expand.grid() can be slow with large sets of combinations and for loops aren't necessarily fast but you have a lot of control in the middle. 在使用大量组合时, expand.grid()可能会变慢,并且for循环不一定很快,但是您在中间有很多控制权。

combinations = expand.grid(levels(x$A),levels(x$B))
for(i in 1:nrow(combinations)){
  d = x[x$A==combinations[i,1] & x$B==combinations[i,2],c("x","y")]
  print(d[1]*d[2])
}
#>   x
   1 2
     x
   3 6
     x
   2 4
     x
   4 8

If you want the fit/predictions instead of summary stats(t-tests, etc), it's easier to fit an interaction model of Y~(A:B)*(X + Cat1 + Cat2) - 1 - X - Cat1 - Cat2 ; 如果您要拟合/预测而不是汇总统计(t检验等),则更容易拟合Y~(A:B)*(X + Cat1 + Cat2) - 1 - X - Cat1 - Cat2的交互模型; by subtracting out the main effects, R will reparameterize and place all the variance on the interactions. 通过减去主要影响,R将重新参数化并将所有方差置于交互作用上。 Here's an example: 这是一个例子:

> mtcars <- within(mtcars, {cyl = as.factor(cyl); am=as.factor(am)})
> model <- lm(mpg~(cyl:am)*(hp+wt)-1-hp-wt, mtcars)
> summary(model)

Call:
lm(formula = mpg ~ (cyl:am) * (hp + wt) - 1 - hp - wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6685 -0.9071  0.0000  0.7705  4.1879 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
cyl4:am0     2.165e+01  2.252e+01   0.961   0.3517    
cyl6:am0     6.340e+01  4.245e+01   1.494   0.1560    
cyl8:am0     2.746e+01  5.000e+00   5.492 6.20e-05 ***
cyl4:am1     4.725e+01  5.144e+00   9.184 1.51e-07 ***
cyl6:am1     2.320e+01  3.808e+01   0.609   0.5515    
cyl8:am1     1.877e+01  1.501e+01   1.251   0.2302    
cyl4:am0:hp -4.635e-02  1.107e-01  -0.419   0.6815    
cyl6:am0:hp  7.425e-03  1.650e-01   0.045   0.9647    
cyl8:am0:hp -2.110e-02  2.531e-02  -0.834   0.4175    
cyl4:am1:hp -7.288e-02  4.457e-02  -1.635   0.1228    
cyl6:am1:hp -2.000e-02  4.733e-02  -0.423   0.6786    
cyl8:am1:hp -1.127e-02  4.977e-02  -0.226   0.8240    
cyl4:am0:wt  1.762e+00  5.341e+00   0.330   0.7460    
cyl6:am0:wt -1.332e+01  1.303e+01  -1.022   0.3231    
cyl8:am0:wt -2.025e+00  1.099e+00  -1.843   0.0851 .  
cyl4:am1:wt -6.465e+00  2.467e+00  -2.621   0.0193 *  
cyl6:am1:wt -4.926e-15  1.386e+01   0.000   1.0000    
cyl8:am1:wt         NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.499 on 15 degrees of freedom
Multiple R-squared:  0.9933,    Adjusted R-squared:  0.9858 
F-statistic: 131.4 on 17 and 15 DF,  p-value: 3.045e-13

compare with a cyl4:am1 submodel: 与cyl4:am1子模型进行比较:

> summary(lm(mpg~wt+hp, mtcars, subset=cyl=='4' & am=='1'))

Call:
lm(formula = mpg ~ wt + hp, data = mtcars, subset = cyl == "4" & 
    am == "1")

Residuals:
    Datsun 710       Fiat 128    Honda Civic Toyota Corolla      Fiat X1-9  Porsche 914-2 
      -2.66851        4.18787       -2.61455        3.25523       -2.62538       -0.77799 
  Lotus Europa     Volvo 142E 
       1.17181        0.07154 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.24552    6.57304   7.188 0.000811 ***
wt          -6.46508    3.15205  -2.051 0.095512 .  
hp          -0.07288    0.05695  -1.280 0.256814    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.193 on 5 degrees of freedom
Multiple R-squared:  0.6378,    Adjusted R-squared:  0.493 
F-statistic: 4.403 on 2 and 5 DF,  p-value: 0.07893

The estimates of the coefficients are exactly the same, and the standard errors are higher/more conservative here, because s is being estimated only from the subset rather than pooling across all the models. 系数的估计值完全相同 ,并且此处的标准误更高/更保守,因为仅从子集中估计s ,而不是在所有模型中进行汇总。 Pooling may or may not be an appropriate assumption for your use case, statistically. 从统计上来说,合并可能适合您的用例,也可能不适合。

It's also much easier to get predictions: predict(model, X) vs having to split-apply-combine again. 获得预测也容易得多: predict(model, X)与必须再次拆分应用合并。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM