简体   繁体   中英

R: How should I deal with variables that only have 1 count when performing linear regression?

gender = sample(10:100, 10000, replace = TRUE)

desks = sample(0:1, 10000, replace = TRUE)

trees = sample(0:1, 10000, replace = TRUE)

leaves = sample(0:1, 10000, replace = TRUE)

people = sample(0:1, 10000, replace = TRUE)

rebel = c(rep(0, 9999), 1)


df = data.frame(cbind(gender, desks, trees, leaves, people, rebel))

lm = lm(gender ~ ., data = df)

summary(lm)

Not sure if this is purely a statistical question.

In this example, we know that rebel has a bunch of 0s and only one 1. If I create a linear model and the p-value of rebel is 0.05, is it wrong to include that variable or to say that the variable's effect is statistically significant?

Should I be removing all columns that only have one 1?

Wouldn't it be misleading if I had a bunch of dummy variables that had a bunch of 0s and they come up as significant on the linear model?

How can we tell if a variable has a 'small sample size' (a bunch of 0s) just by the linear regression summary?

Yes, this is a stats question. While there is no regression assumption that a predictor variable must not be skewed, suffice it to say you generally get huge regression problems with extremely skewed, bivariate distributions. Try out the following code...

 x <- c(1,replicate(9999,0))
 x2<- c(1,1,1,1,1,1,replicate(9994,0))
 y <- c(replicate(9999,0),1)
 cor(x,x)  # 1.0
 cor(x2,y) # -.0002
 cor(x,y)  # -.00001

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM