简体   繁体   English

如何创建虚拟变量?

[英]How to create dummy variables?

I have a variable that is a factor : 我有一个变量是一个因素:

 $ year           : Factor w/ 8 levels "2003","2004",..: 4 6 4 2 4 1 3 3 7 2 ...

I would like to create 8 dummy variables, named "2003", "2004" etc that take the value 0 or 1 depending on the value that the variable "year" takes. 我想创建8个虚拟变量,分别命名为“ 2003”,“ 2004”等,其取值取决于变量“ year”取值的0或1。 The nearest I could come up with is 我能想到的最接近的是

dt1 <- cbind (dt1, model.matrix(~dt1$year - 1) )

But this has the unfortunate consequences of 但这带来了不幸的后果

  1. The dummy variables are named dt1$year2003, not just "2003", "2004" etc 虚拟变量的名称为dt1 $ year2003,而不仅仅是“ 2003”,“ 2004”等
  2. It seems that NA rows are omitted altogether by model.matrix (so the above command fails due to different lengths when NA is present in the year variable). 似乎model.matrix完全省略了NA行(因此,当year变量中存在NA时,由于长度不同,上述命令将失败)。

Of course I can get around these problems with more code, but I like my code to be as concise as possible (within reason) so if anyone can suggest better ways to make the dummy variables I would be obliged. 当然,我可以使用更多的代码来解决这些问题,但是我希望我的代码尽可能简洁(在合理的范围内),因此,如果有人可以提出更好的方法来创建虚拟变量,我将不得不这样做。

You could use ifelse() which won't omit na rows (but I guess you might not count it as being "as concise as possible"): 您可以使用ifelse() ,它不会省略na行(但我想您可能不会认为它“尽可能简洁”):

dt1 <- data.frame(year=factor(rep(2003:2010, 10)))  # example data

dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...    

head(dt1)
#   year yr2003 yr2004 yr2005
# 1 2003      1      0      0
# 2 2004      0      1      0
# 3 2005      0      0      1
# 4 2006      0      0      0
# 5 2007      0      0      0
# 6 2008      0      0      0

This is as concise as I could get. 这是我所能得到的简洁。 The na.action option takes care of the NA values (I would rather do this with an argument than with a global options setting, but I can't see how). na.action选项负责处理NA值(我宁愿使用参数而不是使用全局选项设置,但我不知道怎么做)。 The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix ... 列的命名是非常严格的硬编码,在model.matrix不到任何方法可以覆盖它。

options(na.action=na.pass)
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
              c("year",levels(dt1$year)))

As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names. 如上所述,在某些情况下,列名称可能不是合法的R变量名称,您可能会遇到麻烦。

  year 2003 2004 2005
1 <NA>   NA   NA   NA
2 2003    1    0    0
3 2004    0    1    0
4 2005    0    0    1

library(caret) provides a very simple function ( dummyVars ) to create dummy variables, especially when you have more than one factor variables. library(caret)提供了一个非常简单的函数( dummyVars )创建虚拟变量,尤其是当您有多个因子变量时。 But you have to make sure the target variables are factor. 但是您必须确保目标变量是因子。 eg if your Sales$year are numeric, you have to convert them to factor: as.factor(Sales$year) 例如,如果您的Sales$year是数字,则必须将它们转换为factor: as.factor(Sales$year)

Suppose we have the original dataset 'Sales' as follows: 假设我们有原始数据集“ Sales”,如下所示:

    year    Sales       Region
1   2010    3695.543    North
2   2010    9873.037    West
3   2008    3579.458    West
4   2005    2788.857    North
5   2005    2952.183    North
6   2008    7255.337    West
7   2005    5237.081    West
8   2010    8987.096    North
9   2008    5545.343    North
10  2008    1809.446    West

Now we can create two dummy variables simultaneously: 现在我们可以同时创建两个虚拟变量:

>library(lattice)
>library(ggplot2)
>library(caret)
>Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
>Sdummy <- predict(Salesdummy, Sales)

The outcome will be: 结果将是:

   2005 2008 2010   Sales    RegionNorth    RegionWest
1   0    0    1   3695.543       1              0
2   0    0    1   9873.037       0              1
3   0    1    0   3579.458       0              1
4   1    0    0   2788.857       1              0
5   1    0    0   2952.183       1              0
6   0    1    0   7255.337       0              1
7   1    0    0   5237.081       0              1
8   0    0    1   8987.096       1              0
9   0    1    0   5545.343       1              0 
10  0    1    0   1809.446       0              1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM