Say I have a data set that I'd like to create a lm, for each combination of variable A and B. Where A has two values: 'a' and 'b', and B has three values: 1,2,3. This leaving me with six possible combinations of variables A and B.
This said, I would like to create six (6) models. In example the first model would have the data subsetted where A = a and B = 1.
In SAS, in example, the code would be as follows (please note the by statement):
proc glm data = mydate;
by A B;
class Cat1 Cat2;
model Y = X + Cat1 + Cat2;
run;
The by statement will generate one model for combination of A and B.
This is really just a split-apply step:
split the data into chunks
smydate <- split(mydate, list(A = A, B = B))
Each component of smydate
represents the data for a particular combination of A
and B
. You may need to add drop = TRUE
to the split
call if your data doesn't have all combinations of the levels of A
and B
.
apply the lm()
function over the components of the list smydate
lmFun <- function(dat) { lm(y ~ x + cat1 + cat2, data = dat) } models <- lapply(smydate, lmFun)
Now you have a list, models
, where each component contains a lm
object for the particular combination of A
and B
.
An example (based on the one shown by rawr
in the comments is:
models <- lapply(split(mtcars, list(mtcars$am, mtcars$gear), drop = TRUE),
function(x) {lm(mpg ~ wt + disp, data = x)})
str(models)
models
which gives:
> str(models, max = 1)
List of 4
$ 0.3:List of 12
..- attr(*, "class")= chr "lm"
$ 0.4:List of 12
..- attr(*, "class")= chr "lm"
$ 1.4:List of 12
..- attr(*, "class")= chr "lm"
$ 1.5:List of 12
..- attr(*, "class")= chr "lm"
> models
$`0.3`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
27.994610 -2.384834 -0.007983
$`0.4`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
219.1047 -106.8075 0.9953
$`1.4`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
43.27860 -3.03114 -0.09481
$`1.5`
Call:
lm(formula = mpg ~ wt + disp, data = x)
Coefficients:
(Intercept) wt disp
41.779042 -7.230952 -0.006731
As rawr
notes in the comments, you can do this in fewer steps using by()
, or any one of a number of other higher-level functions in say the plyr package, but doing things by hand at least once illustrates the generality of the approach; you can always use the short cuts once you are familiar with the general idea.
More specifically, you can use lmList
to fit linear models to categories, after using @bjoseph's strategy of generating an interaction variable:
mydate <- transform(mydate, ABcat=interaction(A,B,drop=TRUE))
library("lme4") ## or library("nlme")
lmList(Y~X+Cat1+Cat2|ABcat,mydate)
Using group_by
in the dplyr package will run an analysis for each subgroup combination. Using the mtcars
dataset:
library(dplyr)
res <- mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .))
res$mod
Will give you the list of lm
objects.
Other packages will make this more elegant. You could do this in-line with the magrittr package and go straight to the list of lm
objects:
library(magrittr)
mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .)) %>%
use_series(mod)
Or use the broom package to extract coefficient values from the lm
objects:
library(broom)
mtcars %>%
group_by(am, gear) %>%
do(mod = lm(mpg ~ wt + disp, data = .)) %>%
glance(mod)
Source: local data frame [4 x 13]
Groups: am, gear
am gear r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 0 3 0.6223489 0.5594070 2.2379851 9.887679 0.00290098 3 -31.694140 71.38828 74.22048 60.102926 12
2 0 4 0.9653343 0.8960028 0.9899495 13.923469 0.18618733 3 -2.862760 13.72552 11.27070 0.980000 1
3 1 4 0.7849464 0.6989249 2.9709337 9.125006 0.02144702 3 -18.182504 44.36501 44.68277 44.132234 5
4 1 5 0.9827679 0.9655358 1.2362092 57.031169 0.01723212 3 -5.864214 19.72843 18.16618 3.056426 2
You could try several different things.
Let's say our data is:
structure(list(A = structure(c(1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), B = structure(c(1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), x = c(1, 2, 3, 4), y = c(2, 2, 2, 2)), .Names = c("A", "B", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
x
#> A B x y
1 A A 1 2
2 A B 2 2
3 B A 3 2
4 B B 4 2
by()
This returns a list-type object. Notice that it doesn't return results in the order we might have expected. It's trying to keep the second factor as stable as possible when iterating. You could adjust this by using list(x$B,x$A)
by(x[c("x","y")],list(x$A,x$B),function(x){x[1]*x[2]})
[1] 2
-------------------------------------------------------------------------------------
[1] 6
-------------------------------------------------------------------------------------
[1] 4
-------------------------------------------------------------------------------------
[1] 8
expand.grid()
This is a simple for loop where we pre-generated the combinations of interest, subset the data in the loop and perform the function of interest. expand.grid()
can be slow with large sets of combinations and for loops aren't necessarily fast but you have a lot of control in the middle.
combinations = expand.grid(levels(x$A),levels(x$B))
for(i in 1:nrow(combinations)){
d = x[x$A==combinations[i,1] & x$B==combinations[i,2],c("x","y")]
print(d[1]*d[2])
}
#> x
1 2
x
3 6
x
2 4
x
4 8
If you want the fit/predictions instead of summary stats(t-tests, etc), it's easier to fit an interaction model of Y~(A:B)*(X + Cat1 + Cat2) - 1 - X - Cat1 - Cat2
; by subtracting out the main effects, R will reparameterize and place all the variance on the interactions. Here's an example:
> mtcars <- within(mtcars, {cyl = as.factor(cyl); am=as.factor(am)})
> model <- lm(mpg~(cyl:am)*(hp+wt)-1-hp-wt, mtcars)
> summary(model)
Call:
lm(formula = mpg ~ (cyl:am) * (hp + wt) - 1 - hp - wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-2.6685 -0.9071 0.0000 0.7705 4.1879
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
cyl4:am0 2.165e+01 2.252e+01 0.961 0.3517
cyl6:am0 6.340e+01 4.245e+01 1.494 0.1560
cyl8:am0 2.746e+01 5.000e+00 5.492 6.20e-05 ***
cyl4:am1 4.725e+01 5.144e+00 9.184 1.51e-07 ***
cyl6:am1 2.320e+01 3.808e+01 0.609 0.5515
cyl8:am1 1.877e+01 1.501e+01 1.251 0.2302
cyl4:am0:hp -4.635e-02 1.107e-01 -0.419 0.6815
cyl6:am0:hp 7.425e-03 1.650e-01 0.045 0.9647
cyl8:am0:hp -2.110e-02 2.531e-02 -0.834 0.4175
cyl4:am1:hp -7.288e-02 4.457e-02 -1.635 0.1228
cyl6:am1:hp -2.000e-02 4.733e-02 -0.423 0.6786
cyl8:am1:hp -1.127e-02 4.977e-02 -0.226 0.8240
cyl4:am0:wt 1.762e+00 5.341e+00 0.330 0.7460
cyl6:am0:wt -1.332e+01 1.303e+01 -1.022 0.3231
cyl8:am0:wt -2.025e+00 1.099e+00 -1.843 0.0851 .
cyl4:am1:wt -6.465e+00 2.467e+00 -2.621 0.0193 *
cyl6:am1:wt -4.926e-15 1.386e+01 0.000 1.0000
cyl8:am1:wt NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.499 on 15 degrees of freedom
Multiple R-squared: 0.9933, Adjusted R-squared: 0.9858
F-statistic: 131.4 on 17 and 15 DF, p-value: 3.045e-13
compare with a cyl4:am1 submodel:
> summary(lm(mpg~wt+hp, mtcars, subset=cyl=='4' & am=='1'))
Call:
lm(formula = mpg ~ wt + hp, data = mtcars, subset = cyl == "4" &
am == "1")
Residuals:
Datsun 710 Fiat 128 Honda Civic Toyota Corolla Fiat X1-9 Porsche 914-2
-2.66851 4.18787 -2.61455 3.25523 -2.62538 -0.77799
Lotus Europa Volvo 142E
1.17181 0.07154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47.24552 6.57304 7.188 0.000811 ***
wt -6.46508 3.15205 -2.051 0.095512 .
hp -0.07288 0.05695 -1.280 0.256814
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.193 on 5 degrees of freedom
Multiple R-squared: 0.6378, Adjusted R-squared: 0.493
F-statistic: 4.403 on 2 and 5 DF, p-value: 0.07893
The estimates of the coefficients are exactly the same, and the standard errors are higher/more conservative here, because s is being estimated only from the subset rather than pooling across all the models. Pooling may or may not be an appropriate assumption for your use case, statistically.
It's also much easier to get predictions: predict(model, X)
vs having to split-apply-combine again.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.