gender = sample(10:100, 10000, replace = TRUE)
desks = sample(0:1, 10000, replace = TRUE)
trees = sample(0:1, 10000, replace = TRUE)
leaves = sample(0:1, 10000, replace = TRUE)
people = sample(0:1, 10000, replace = TRUE)
rebel = c(rep(0, 9999), 1)
df = data.frame(cbind(gender, desks, trees, leaves, people, rebel))
lm = lm(gender ~ ., data = df)
summary(lm)
Not sure if this is purely a statistical question.
In this example, we know that rebel has a bunch of 0s and only one 1. If I create a linear model and the p-value of rebel is 0.05, is it wrong to include that variable or to say that the variable's effect is statistically significant?
Should I be removing all columns that only have one 1?
Wouldn't it be misleading if I had a bunch of dummy variables that had a bunch of 0s and they come up as significant on the linear model?
How can we tell if a variable has a 'small sample size' (a bunch of 0s) just by the linear regression summary?
Yes, this is a stats question. While there is no regression assumption that a predictor variable must not be skewed, suffice it to say you generally get huge regression problems with extremely skewed, bivariate distributions. Try out the following code...
x <- c(1,replicate(9999,0))
x2<- c(1,1,1,1,1,1,replicate(9994,0))
y <- c(replicate(9999,0),1)
cor(x,x) # 1.0
cor(x2,y) # -.0002
cor(x,y) # -.00001
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.