简体   繁体   中英

How to create dummy variables?

I have a variable that is a factor :

 $ year           : Factor w/ 8 levels "2003","2004",..: 4 6 4 2 4 1 3 3 7 2 ...

I would like to create 8 dummy variables, named "2003", "2004" etc that take the value 0 or 1 depending on the value that the variable "year" takes. The nearest I could come up with is

dt1 <- cbind (dt1, model.matrix(~dt1$year - 1) )

But this has the unfortunate consequences of

  1. The dummy variables are named dt1$year2003, not just "2003", "2004" etc
  2. It seems that NA rows are omitted altogether by model.matrix (so the above command fails due to different lengths when NA is present in the year variable).

Of course I can get around these problems with more code, but I like my code to be as concise as possible (within reason) so if anyone can suggest better ways to make the dummy variables I would be obliged.

You could use ifelse() which won't omit na rows (but I guess you might not count it as being "as concise as possible"):

dt1 <- data.frame(year=factor(rep(2003:2010, 10)))  # example data

dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...    

head(dt1)
#   year yr2003 yr2004 yr2005
# 1 2003      1      0      0
# 2 2004      0      1      0
# 3 2005      0      0      1
# 4 2006      0      0      0
# 5 2007      0      0      0
# 6 2008      0      0      0

This is as concise as I could get. The na.action option takes care of the NA values (I would rather do this with an argument than with a global options setting, but I can't see how). The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix ...

options(na.action=na.pass)
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
              c("year",levels(dt1$year)))

As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names.

  year 2003 2004 2005
1 <NA>   NA   NA   NA
2 2003    1    0    0
3 2004    0    1    0
4 2005    0    0    1

library(caret) provides a very simple function ( dummyVars ) to create dummy variables, especially when you have more than one factor variables. But you have to make sure the target variables are factor. eg if your Sales$year are numeric, you have to convert them to factor: as.factor(Sales$year)

Suppose we have the original dataset 'Sales' as follows:

    year    Sales       Region
1   2010    3695.543    North
2   2010    9873.037    West
3   2008    3579.458    West
4   2005    2788.857    North
5   2005    2952.183    North
6   2008    7255.337    West
7   2005    5237.081    West
8   2010    8987.096    North
9   2008    5545.343    North
10  2008    1809.446    West

Now we can create two dummy variables simultaneously:

>library(lattice)
>library(ggplot2)
>library(caret)
>Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
>Sdummy <- predict(Salesdummy, Sales)

The outcome will be:

   2005 2008 2010   Sales    RegionNorth    RegionWest
1   0    0    1   3695.543       1              0
2   0    0    1   9873.037       0              1
3   0    1    0   3579.458       0              1
4   1    0    0   2788.857       1              0
5   1    0    0   2952.183       1              0
6   0    1    0   7255.337       0              1
7   1    0    0   5237.081       0              1
8   0    0    1   8987.096       1              0
9   0    1    0   5545.343       1              0 
10  0    1    0   1809.446       0              1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM