简体   繁体   中英

How does R lm choose contrasts with interaction between a categorical and continuous variables?

If I run lm with a formula like Y ~ X1 + X2:X1 + X3:X1 where X1 is continuous and X2,X3 are categorical, I get a contrast for both levels of X2, but not X3.

The pattern is that the first categorical interaction gets both levels but not the second.

library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

df = data.frame(Frivolousness = sample(1:100, 50, replace =T))
df %<>% mutate(
  Personality=sample(c("Bad", "Good"), 50, replace = T),
  Timing=ifelse(Frivolousness %% 2 == 0 & runif(50) > 0.2, "Early", "Late")
  )
df %<>% mutate(
  Enchantedness = 11 + 
    ifelse(Personality=="Good", 0.23, -0.052)*Frivolousness -
    1.3*ifelse(Personality=="Good", 1, 0) +
    10*rnorm(50)
  )
df %<>% mutate(
  Personality = factor(Personality, levels=c("Bad", "Good")),
  Timing = factor(Timing, levels=c("Early", "Late"))
)

lm(Enchantedness ~ Personality + Timing + Timing:Frivolousness + Personality:Frivolousness, df)
#> 
#> Call:
#> lm(formula = Enchantedness ~ Personality + Timing + Timing:Frivolousness + 
#>     Personality:Frivolousness, data = df)
#> 
#> Coefficients:
#>                   (Intercept)                PersonalityGood  
#>                      15.64118                      -10.99518  
#>                    TimingLate      TimingEarly:Frivolousness  
#>                      -1.41757                       -0.05796  
#>      TimingLate:Frivolousness  PersonalityGood:Frivolousness  
#>                      -0.07433                        0.33410

lm(Enchantedness ~ Personality + Timing + Personality:Frivolousness+ Timing:Frivolousness , df)
#> 
#> Call:
#> lm(formula = Enchantedness ~ Personality + Timing + Personality:Frivolousness + 
#>     Timing:Frivolousness, data = df)
#> 
#> Coefficients:
#>                   (Intercept)                PersonalityGood  
#>                      15.64118                      -10.99518  
#>                    TimingLate   PersonalityBad:Frivolousness  
#>                      -1.41757                       -0.05796  
#> PersonalityGood:Frivolousness       TimingLate:Frivolousness  
#>                       0.27614                       -0.01636

Created on 2020-02-15 by the reprex package (v0.3.0)

I think the reason it is dropped is that there would be perfect colinearity if it was included. You really should have Frivolousness as a regressor on its own also. Then, you will see that R provides you with the result for just one level of both interactions.

You get this kind of weird behavior because you are missing the term main term, Frivolousness . If you do:

set.seed(111)
## run your data frame stuff
lm(Enchantedness ~ Personality + Timing + Timing:Frivolousness + Personality:Frivolousness, df)

Coefficients:
                  (Intercept)                PersonalityGood  
                     -1.74223                        5.31189  
                   TimingLate      TimingEarly:Frivolousness  
                     12.47243                        0.19090  
     TimingLate:Frivolousness  PersonalityGood:Frivolousness  
                     -0.09496                        0.17383  

    lm(Enchantedness ~ Personality + Timing + Frivolousness+Timing:Frivolousness + Personality:Frivolousness, df)

Coefficients:
                  (Intercept)                PersonalityGood  
                      -1.7422                         5.3119  
                   TimingLate                  Frivolousness  
                      12.4724                         0.1909  
     TimingLate:Frivolousness  PersonalityGood:Frivolousness  
                      -0.2859                         0.1738  

In your model, the interaction term TimingLate:Frivolousness means the change in slope of Frivolousness when Timing is Late. Since the default is not estimated, it has to do it for TimingEarly (the reference level). Hence you can see the coefficients for TimingEarly:Frivolousness and Frivolousness are the same.

As you can see the TimingLate:Frivolousness are very different and In your case I think doesn't make sense to do only the interaction term without the main effect, because it's hard to interpret or model it.

You can roughly check what is the slope for different groups of timing and the model with all terms gives a good estimate:

df %>% group_by(Timing) %>% do(tidy(lm(Enchantedness ~ Frivolousness,data=.)))
# A tibble: 4 x 6
# Groups:   Timing [2]
  Timing term          estimate std.error statistic p.value
  <fct>  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 Early  (Intercept)    6.13       6.29      0.975   0.341 
2 Early  Frivolousness  0.208      0.0932    2.23    0.0366
3 Late   (Intercept)   11.5        5.35      2.14    0.0419
4 Late   Frivolousness -0.00944    0.107    -0.0882  0.930 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM