简体   繁体   中英

create dummy variables from levels

This is my wants:

age colorred colorgreen colorblue
   1        1          0         0
   2        0          1         0
   3        0          0         1

I can easily create the data as long as the dataframe contains enough rows to represents all levels of factors. I tend to use the package dummies and this works:

library(dummies)
df <- data.frame(
    age = c(1,2,3)
    , color = c("red", "green", "blue")
)
df$color <- factor(as.character(df$color), ordered = FALSE, levels = c("red", "green", "blue"))
str(df)
df <- dummy.data.frame(df, names = c("color"))
df

However, if the dataframe does not contain enough data I do not obtain the required format:

library(dummies)

df <- data.frame(
    age = 33
    , color = "red"
)
df$color <- factor(as.character(df$color), ordered = FALSE, levels = c("red", "green", "blue"))
str(df)
df <- dummy.data.frame(df, names = c("color"))
df

is it possible to bake the transformation into some model, which transforms even if the data only contains one row?

You don't really need any packages to do this. In base R you could do:

my_columns <- c("red", "green", "blue")

df <- data.frame(
    age = c(1,2,3), 
    color = c("red", "green", "blue")
)

cbind(age = df$age, `colnames<-`(as.data.frame(t(sapply(df$color, 
      function(x) as.numeric(x == my_columns)))), my_columns))
#>   age red green blue
#> 1   1   1     0    0
#> 2   2   0     1    0
#> 3   3   0     0    1

df <- data.frame(
    age = 33, color = "red"
)

cbind(age = df$age, `colnames<-`(as.data.frame(t(sapply(df$color, 
      function(x) as.numeric(x == my_columns)))), my_columns))
#>   age red green blue
#> 1  33   1     0    0

EDIT

A more complete solution allowing processing of multiple columns at once could be achieved by writing a function to handle the logic:

expand_factors <- function(df, columns)
{
  for(column in columns){
    if(is.character(df[[column]])) df[[column]] <- factor(df[[column]])
    my_columns <- levels(df[[column]])
    mat <- t(sapply(df[[column]], function(x) as.numeric(x == my_columns)))
    new_cols <- setNames(as.data.frame(mat), my_columns)
    df <- cbind(df[which(names(df) != column)], new_cols)
  }
  df
}

So that if I had this data frame:

df <- data.frame(age = 1:3,
                 shoe_size = 4:6,
                 colors = c("red", "green", "blue"),
                 fruits = c("apples", "bananas", "cherries"),
                 temp   = factor(rep("cold", 3), levels = c("hot", "cold")))

df
#>   age shoe_size colors   fruits temp
#> 1   1         4    red   apples cold
#> 2   2         5  green  bananas cold
#> 3   3         6   blue cherries cold

Then I can expand all the factors I like by doing this:

expand_factors(df, c("colors", "fruits", "temp"))
#>   age shoe_size blue green red apples bananas cherries hot cold
#> 1   1         4    0     0   1      1       0        0   0    1
#> 2   2         5    0     1   0      0       1        0   0    1
#> 3   3         6    1     0   0      0       0        1   0    1

Created on 2020-08-20 by the reprex package (v0.3.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM