[英]How to create dummy variables?
I have a variable that is a factor : 我有一个变量是一个因素:
$ year : Factor w/ 8 levels "2003","2004",..: 4 6 4 2 4 1 3 3 7 2 ...
I would like to create 8 dummy variables, named "2003", "2004" etc that take the value 0 or 1 depending on the value that the variable "year" takes. 我想创建8个虚拟变量,分别命名为“ 2003”,“ 2004”等,其取值取决于变量“ year”取值的0或1。 The nearest I could come up with is
我能想到的最接近的是
dt1 <- cbind (dt1, model.matrix(~dt1$year - 1) )
But this has the unfortunate consequences of 但这带来了不幸的后果
model.matrix
(so the above command fails due to different lengths when NA is present in the year
variable). model.matrix
完全省略了NA行(因此,当year
变量中存在NA时,由于长度不同,上述命令将失败)。 Of course I can get around these problems with more code, but I like my code to be as concise as possible (within reason) so if anyone can suggest better ways to make the dummy variables I would be obliged. 当然,我可以使用更多的代码来解决这些问题,但是我希望我的代码尽可能简洁(在合理的范围内),因此,如果有人可以提出更好的方法来创建虚拟变量,我将不得不这样做。
You could use ifelse()
which won't omit na
rows (but I guess you might not count it as being "as concise as possible"): 您可以使用
ifelse()
,它不会省略na
行(但我想您可能不会认为它“尽可能简洁”):
dt1 <- data.frame(year=factor(rep(2003:2010, 10))) # example data
dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...
head(dt1)
# year yr2003 yr2004 yr2005
# 1 2003 1 0 0
# 2 2004 0 1 0
# 3 2005 0 0 1
# 4 2006 0 0 0
# 5 2007 0 0 0
# 6 2008 0 0 0
This is as concise as I could get. 这是我所能得到的简洁。 The
na.action
option takes care of the NA
values (I would rather do this with an argument than with a global options setting, but I can't see how). na.action
选项负责处理NA
值(我宁愿使用参数而不是使用全局选项设置,但我不知道怎么做)。 The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix
... 列的命名是非常严格的硬编码,在
model.matrix
不到任何方法可以覆盖它。
options(na.action=na.pass)
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
c("year",levels(dt1$year)))
As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names. 如上所述,在某些情况下,列名称可能不是合法的R变量名称,您可能会遇到麻烦。
year 2003 2004 2005
1 <NA> NA NA NA
2 2003 1 0 0
3 2004 0 1 0
4 2005 0 0 1
library(caret)
provides a very simple function ( dummyVars
) to create dummy variables, especially when you have more than one factor variables. library(caret)
提供了一个非常简单的函数( dummyVars
)创建虚拟变量,尤其是当您有多个因子变量时。 But you have to make sure the target variables are factor. 但是您必须确保目标变量是因子。 eg if your
Sales$year
are numeric, you have to convert them to factor: as.factor(Sales$year)
例如,如果您的
Sales$year
是数字,则必须将它们转换为factor: as.factor(Sales$year)
Suppose we have the original dataset 'Sales' as follows: 假设我们有原始数据集“ Sales”,如下所示:
year Sales Region
1 2010 3695.543 North
2 2010 9873.037 West
3 2008 3579.458 West
4 2005 2788.857 North
5 2005 2952.183 North
6 2008 7255.337 West
7 2005 5237.081 West
8 2010 8987.096 North
9 2008 5545.343 North
10 2008 1809.446 West
Now we can create two dummy variables simultaneously: 现在我们可以同时创建两个虚拟变量:
>library(lattice)
>library(ggplot2)
>library(caret)
>Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
>Sdummy <- predict(Salesdummy, Sales)
The outcome will be: 结果将是:
2005 2008 2010 Sales RegionNorth RegionWest
1 0 0 1 3695.543 1 0
2 0 0 1 9873.037 0 1
3 0 1 0 3579.458 0 1
4 1 0 0 2788.857 1 0
5 1 0 0 2952.183 1 0
6 0 1 0 7255.337 0 1
7 1 0 0 5237.081 0 1
8 0 0 1 8987.096 1 0
9 0 1 0 5545.343 1 0
10 0 1 0 1809.446 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.