简体   繁体   中英

Argument for setting the reference group to the biggest group in linear regression

Is there, either for the lm() function or for some other function for linear regression, an argument such that the reference group can be set to always be the biggest group rather than the alphabetical/numerical default in lm()?

As this's often done in stats, I'm thinking I somehow keep missing it when I search the documentation or that I'm looking in the wrong places. Any help would be appreciated!

Below, even when in a UDF, is what I'd like NOT to have to keep doing .

mtcars # load dataset 
mtcars <- mtcars[1:31, ]  # remove a now so that there is a single biggest group
lm(mpg ~ gear+carb+disp, data = mtcars ) # carb's group 1 is the reference by default 
mtcars$carb <- as.factor(mtcars$carb) 
mtcars <- within(mtcars, carb <- relevel(carb, ref = "4")) # set carb's group 4 as the reference
lm(mpg ~ gear+carb+disp, data = mtcars ) 

It doesn't look like lm has any option for this, but you can just create a wrapper function to change the levels of a factor accounting to frequeuncy, then use that in the formula.

big.ref <- function(x) {
  if(!is.factor(x)) x<-factor(x)
  counts <- sort(table(x), decreasing = TRUE)
  relevel(x, ref=names(counts)[1])
}
lm(mpg ~ gear + big.ref(carb) + disp, data = mtcars ) 

I don't believe there is a built-in function to do that but it's not that difficult to write one.

largest_ref <- function(DF, col){
    DF[[col]] <- factor(DF[[col]])
    tbl <- table(DF[[col]])
    largest <- names(tbl)[which.max(tbl)]
    DF[[col]] <- relevel(DF[[col]], ref = largest)
    DF
}

Now I will reload the test dataset and change a copy of it. Then run regressions on both datasets, the one releveled by your code and the one releveled by the function above.

data(mtcars)
mtcars <- mtcars[1:31, ]
mtc <- mtcars

mtcars$carb <- as.factor(mtcars$carb) 
mtcars <- within(mtcars, carb <- relevel(carb, ref = "4")) # set carb's group 4 as the reference
fit1 <- lm(mpg ~ gear + carb + disp, data = mtcars) 

mtc <- largest_ref(mtc, "carb")
fit2 <- lm(mpg ~ gear + carb + disp, data = mtc) 

identical(coef(fit1), coef(fit2))
#[1] TRUE

As you can see, the results are the same. You can further see it with (output omited).

summary(fit1)
summary(fit2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM