R: How should I deal with variables that only have 1 count when performing linear regression?

Question

gender = sample(10:100, 10000, replace = TRUE)

desks = sample(0:1, 10000, replace = TRUE)

trees = sample(0:1, 10000, replace = TRUE)

leaves = sample(0:1, 10000, replace = TRUE)

people = sample(0:1, 10000, replace = TRUE)

rebel = c(rep(0, 9999), 1)


df = data.frame(cbind(gender, desks, trees, leaves, people, rebel))

lm = lm(gender ~ ., data = df)

summary(lm)

Not sure if this is purely a statistical question.

In this example, we know that rebel has a bunch of 0s and only one 1. If I create a linear model and the p-value of rebel is 0.05, is it wrong to include that variable or to say that the variable's effect is statistically significant?

Should I be removing all columns that only have one 1?

Wouldn't it be misleading if I had a bunch of dummy variables that had a bunch of 0s and they come up as significant on the linear model?

How can we tell if a variable has a 'small sample size' (a bunch of 0s) just by the linear regression summary?

Answer 1

Yes, this is a stats question. While there is no regression assumption that a predictor variable must not be skewed, suffice it to say you generally get huge regression problems with extremely skewed, bivariate distributions. Try out the following code...

 x <- c(1,replicate(9999,0))
 x2<- c(1,1,1,1,1,1,replicate(9994,0))
 y <- c(replicate(9999,0),1)
 cor(x,x)  # 1.0
 cor(x2,y) # -.0002
 cor(x,y)  # -.00001

R: How should I deal with variables that only have 1 count when performing linear regression?

Question

1 answers

solution1
0 2022-08-02 16:14:56

R: How should I deal with variables that only have 1 count when performing linear regression?

Question

1 answers

solution1 0 2022-08-02 16:14:56

solution1
0 2022-08-02 16:14:56