简体   繁体   English

生成伪变量后?

[英]After generating dummy variables?

I am trying to change the category variables into dummy variables. 我正在尝试将类别变量更改为虚拟变量。 "season","holiday","workingday","weather","temp","atemp","humidity","windspeed", "registered","count","hour","dow" are all variables. “季节”,“假日”,“工作日”,“天气”,“温度”,“温度”,“湿度”,“风速”,“已注册”,“计数”,“小时”,“降落”都是变量。

Here is my code: 这是我的代码:

#dummy
library(dummies)
#set up new dummy variables
data.new = data.frame(data)
data.new = cbind(data.new,dummy(data.new$season, sep = "_"))
data.new = cbind(data.new,dummy(data.new$holiday, sep = "_"))
data.new = cbind(data.new,dummy(data.new$weather, sep = "_"))
data.new = cbind(data.new,dummy(data.new$dow, sep = "_"))
data.new = cbind(data.new,dummy(data.new$hour, sep = "_"))
data.new = cbind(data.new,dummy(data.new$workingday, sep = "_"))
#delete the old variables
data.new = data.new[,-1]
data.new = data.new[,-1]
data.new = data.new[,-2]
data.new = data.new[,-8]
data.new = data.new[,-8]
data.new = data.new[,-1]

Should I delete the old variables after generating the dummy variables? 生成虚拟变量后,是否应该删除旧变量? If I want to do PCR, may I use all variables, eg 如果我想进行PCR,可以使用所有变量,例如

fit = pcr(count~.,data = data.new) 

to generate a linear regression model? 生成线性回归模型? Or should I just use the not dummy variables? 还是应该只使用非虚拟变量?

fit = pcr(count~temp+atemp+humidity+windspeed+registered,data = data.new)

Sorry to cause your misunderstanding. 抱歉造成您的误会。 I used lm function as an example. 我以lm函数为例。 Now I have changed it into pcr function. 现在,我将其更改为pcr函数。 Thank you for reading this question! 感谢您阅读此问题!

As long as your categorical variables are factors, the lm function will handle the creation of dummy variables for you. 只要您的分类变量是因子, lm函数就会为您处理虚拟变量的创建。

I would recommend you first verify that your data is a data.frame and the predictors that are categorical are indeed factors. 我建议您首先验证您的数据是一个data.frame ,并且分类的确是预测因素。

class(data)
sapply(data, class)

Or more simply 或更简单

str(data)

Then, simply put them in your formula in your lm call. 然后,只需在lm调用中将它们放在您的公式中即可。

fit = lm(count ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + registered + hour + dow, data=data)

Or if the columns in the formula are the only ones in your data.frame then you can use the short-hand. 或者,如果公式中的列是data.frame唯一的列,则可以使用简写形式。

fit = lm(count ~ ., data=data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM