In R, when creating a model, is there an equivalent to the by statement in SAS?

Question

Say I have a data set that I'd like to create a lm, for each combination of variable A and B. Where A has two values: 'a' and 'b', and B has three values: 1,2,3. This leaving me with six possible combinations of variables A and B.

This said, I would like to create six (6) models. In example the first model would have the data subsetted where A = a and B = 1.

In SAS, in example, the code would be as follows (please note the by statement):

proc glm data = mydate;
by A B;
class Cat1 Cat2;
model Y = X + Cat1 + Cat2;
run;

The by statement will generate one model for combination of A and B.

Answer 1

This is really just a split-apply step:

split the data into chunks
```
 smydate <- split(mydate, list(A = A, B = B)) 
```
Each component of smydate represents the data for a particular combination of A and B . You may need to add drop = TRUE to the split call if your data doesn't have all combinations of the levels of A and B .

apply the lm() function over the components of the list smydate

 lmFun <- function(dat) { lm(y ~ x + cat1 + cat2, data = dat) } models <- lapply(smydate, lmFun)

Now you have a list, models , where each component contains a lm object for the particular combination of A and B .

An example (based on the one shown by rawr in the comments is:

models <- lapply(split(mtcars, list(mtcars$am, mtcars$gear), drop = TRUE),
                 function(x) {lm(mpg ~ wt + disp, data = x)})
str(models)
models

which gives:

> str(models, max = 1)
List of 4
 $ 0.3:List of 12
  ..- attr(*, "class")= chr "lm"
 $ 0.4:List of 12
  ..- attr(*, "class")= chr "lm"
 $ 1.4:List of 12
  ..- attr(*, "class")= chr "lm"
 $ 1.5:List of 12
  ..- attr(*, "class")= chr "lm"
> models
$`0.3`

Call:
lm(formula = mpg ~ wt + disp, data = x)

Coefficients:
(Intercept)           wt         disp  
  27.994610    -2.384834    -0.007983  


$`0.4`

Call:
lm(formula = mpg ~ wt + disp, data = x)

Coefficients:
(Intercept)           wt         disp  
   219.1047    -106.8075       0.9953  


$`1.4`

Call:
lm(formula = mpg ~ wt + disp, data = x)

Coefficients:
(Intercept)           wt         disp  
   43.27860     -3.03114     -0.09481  


$`1.5`

Call:
lm(formula = mpg ~ wt + disp, data = x)

Coefficients:
(Intercept)           wt         disp  
  41.779042    -7.230952    -0.006731

As rawr notes in the comments, you can do this in fewer steps using by() , or any one of a number of other higher-level functions in say the plyr package, but doing things by hand at least once illustrates the generality of the approach; you can always use the short cuts once you are familiar with the general idea.

Answer 2

More specifically, you can use lmList to fit linear models to categories, after using @bjoseph's strategy of generating an interaction variable:

mydate <- transform(mydate, ABcat=interaction(A,B,drop=TRUE))
library("lme4")  ## or library("nlme")
lmList(Y~X+Cat1+Cat2|ABcat,mydate)

Answer 3

Using group_by in the dplyr package will run an analysis for each subgroup combination. Using the mtcars dataset:

library(dplyr)
res <- mtcars %>%
  group_by(am, gear) %>%
  do(mod = lm(mpg ~ wt + disp, data = .))

res$mod

Will give you the list of lm objects.

Other packages will make this more elegant. You could do this in-line with the magrittr package and go straight to the list of lm objects:

library(magrittr)
mtcars %>%
  group_by(am, gear) %>%
  do(mod = lm(mpg ~ wt + disp, data = .)) %>%
  use_series(mod)

Or use the broom package to extract coefficient values from the lm objects:

library(broom)
mtcars %>%
  group_by(am, gear) %>%
  do(mod = lm(mpg ~ wt + disp, data = .)) %>%
  glance(mod)

Source: local data frame [4 x 13]
Groups: am, gear

  am gear r.squared adj.r.squared     sigma statistic    p.value df     logLik      AIC      BIC  deviance df.residual
1  0    3 0.6223489     0.5594070 2.2379851  9.887679 0.00290098  3 -31.694140 71.38828 74.22048 60.102926          12
2  0    4 0.9653343     0.8960028 0.9899495 13.923469 0.18618733  3  -2.862760 13.72552 11.27070  0.980000           1
3  1    4 0.7849464     0.6989249 2.9709337  9.125006 0.02144702  3 -18.182504 44.36501 44.68277 44.132234           5
4  1    5 0.9827679     0.9655358 1.2362092 57.031169 0.01723212  3  -5.864214 19.72843 18.16618  3.056426           2

Answer 4

You could try several different things.

Let's say our data is:

structure(list(A = structure(c(1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), B = structure(c(1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), x = c(1, 2, 3, 4), y = c(2, 2, 2, 2)), .Names = c("A", "B", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
x
#>   A B x y
   1 A A 1 2
   2 A B 2 2
   3 B A 3 2
   4 B B 4 2

by()

This returns a list-type object. Notice that it doesn't return results in the order we might have expected. It's trying to keep the second factor as stable as possible when iterating. You could adjust this by using list(x$B,x$A)

by(x[c("x","y")],list(x$A,x$B),function(x){x[1]*x[2]})
[1] 2
------------------------------------------------------------------------------------- 
[1] 6
------------------------------------------------------------------------------------- 
[1] 4
------------------------------------------------------------------------------------- 
[1] 8

expand.grid()

This is a simple for loop where we pre-generated the combinations of interest, subset the data in the loop and perform the function of interest. expand.grid() can be slow with large sets of combinations and for loops aren't necessarily fast but you have a lot of control in the middle.

combinations = expand.grid(levels(x$A),levels(x$B))
for(i in 1:nrow(combinations)){
  d = x[x$A==combinations[i,1] & x$B==combinations[i,2],c("x","y")]
  print(d[1]*d[2])
}
#>   x
   1 2
     x
   3 6
     x
   2 4
     x
   4 8

Answer 5

If you want the fit/predictions instead of summary stats(t-tests, etc), it's easier to fit an interaction model of Y~(A:B)*(X + Cat1 + Cat2) - 1 - X - Cat1 - Cat2 ; by subtracting out the main effects, R will reparameterize and place all the variance on the interactions. Here's an example:

> mtcars <- within(mtcars, {cyl = as.factor(cyl); am=as.factor(am)})
> model <- lm(mpg~(cyl:am)*(hp+wt)-1-hp-wt, mtcars)
> summary(model)

Call:
lm(formula = mpg ~ (cyl:am) * (hp + wt) - 1 - hp - wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6685 -0.9071  0.0000  0.7705  4.1879 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
cyl4:am0     2.165e+01  2.252e+01   0.961   0.3517    
cyl6:am0     6.340e+01  4.245e+01   1.494   0.1560    
cyl8:am0     2.746e+01  5.000e+00   5.492 6.20e-05 ***
cyl4:am1     4.725e+01  5.144e+00   9.184 1.51e-07 ***
cyl6:am1     2.320e+01  3.808e+01   0.609   0.5515    
cyl8:am1     1.877e+01  1.501e+01   1.251   0.2302    
cyl4:am0:hp -4.635e-02  1.107e-01  -0.419   0.6815    
cyl6:am0:hp  7.425e-03  1.650e-01   0.045   0.9647    
cyl8:am0:hp -2.110e-02  2.531e-02  -0.834   0.4175    
cyl4:am1:hp -7.288e-02  4.457e-02  -1.635   0.1228    
cyl6:am1:hp -2.000e-02  4.733e-02  -0.423   0.6786    
cyl8:am1:hp -1.127e-02  4.977e-02  -0.226   0.8240    
cyl4:am0:wt  1.762e+00  5.341e+00   0.330   0.7460    
cyl6:am0:wt -1.332e+01  1.303e+01  -1.022   0.3231    
cyl8:am0:wt -2.025e+00  1.099e+00  -1.843   0.0851 .  
cyl4:am1:wt -6.465e+00  2.467e+00  -2.621   0.0193 *  
cyl6:am1:wt -4.926e-15  1.386e+01   0.000   1.0000    
cyl8:am1:wt         NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.499 on 15 degrees of freedom
Multiple R-squared:  0.9933,    Adjusted R-squared:  0.9858 
F-statistic: 131.4 on 17 and 15 DF,  p-value: 3.045e-13

compare with a cyl4:am1 submodel:

> summary(lm(mpg~wt+hp, mtcars, subset=cyl=='4' & am=='1'))

Call:
lm(formula = mpg ~ wt + hp, data = mtcars, subset = cyl == "4" & 
    am == "1")

Residuals:
    Datsun 710       Fiat 128    Honda Civic Toyota Corolla      Fiat X1-9  Porsche 914-2 
      -2.66851        4.18787       -2.61455        3.25523       -2.62538       -0.77799 
  Lotus Europa     Volvo 142E 
       1.17181        0.07154 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.24552    6.57304   7.188 0.000811 ***
wt          -6.46508    3.15205  -2.051 0.095512 .  
hp          -0.07288    0.05695  -1.280 0.256814    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.193 on 5 degrees of freedom
Multiple R-squared:  0.6378,    Adjusted R-squared:  0.493 
F-statistic: 4.403 on 2 and 5 DF,  p-value: 0.07893

The estimates of the coefficients are exactly the same, and the standard errors are higher/more conservative here, because s is being estimated only from the subset rather than pooling across all the models. Pooling may or may not be an appropriate assumption for your use case, statistically.

It's also much easier to get predictions: predict(model, X) vs having to split-apply-combine again.

In R, when creating a model, is there an equivalent to the by statement in SAS?

Question

5 answers

solution1
3 2015-06-01 16:45:08

solution2
2 2015-06-01 16:55:19

solution3
2 ACCPTED 2015-06-01 19:41:00

solution4
1 2015-06-01 17:02:29

solution5
0 2015-06-01 19:12:34

In R, when creating a model, is there an equivalent to the by statement in SAS?

Question

5 answers

solution1 3 2015-06-01 16:45:08

solution2 2 2015-06-01 16:55:19

solution3 2 ACCPTED 2015-06-01 19:41:00

solution4 1 2015-06-01 17:02:29

solution5 0 2015-06-01 19:12:34

solution1
3 2015-06-01 16:45:08

solution2
2 2015-06-01 16:55:19

solution3
2 ACCPTED 2015-06-01 19:41:00

solution4
1 2015-06-01 17:02:29

solution5
0 2015-06-01 19:12:34