简体   繁体   中英

Adding new columns based to existing columns 's modalites with 0/1 values

Suppose that I have the column SENSIBILITE and TYPE_PEAU in the data.table DataIns .

> unique(DataIns$SENSIBILITE)
[1] "Fréquente"     "Occasionnelle" "Aucune"

> unique(DataIns$TYPE_PEAU)
[1] "Mixte"   "Sèche"   "Normale" "Grasse"

As you see, each column has many modalities. Then I want to create new columns based on its which have binary values according to the modalities for each observation. In another word, if I have:

> head(DataIns[,c("SENSIBILITE","TYPE_PEAU")])
     SENSIBILITE TYPE_PEAU
1:     Fréquente     Mixte
2:     Fréquente     Mixte
3:     Fréquente     Sèche
4: Occasionnelle     Mixte
5: Occasionnelle     Mixte
6:        Aucune   Normale

I need to get as result:

> head(DataIns)
   TYPE_PEAU_M TYPE_PEAU_N TYPE_PEAU_S TYPE_PEAU_G SENSIBILITE_A SENSIBILITE_O SENSIBILITE_F
1:           1           0           0           0             0             0             1
2:           1           0           0           0             0             0             1
3:           0           0           1           0             0             0             1
4:           1           0           0           0             0             1             0
5:           1           0           0           0             0             1             0
6:           0           1           0           0             1             0             0

I get the result above using this code:

DataIns<-DataIns[,.(TYPE_PEAU_M=as.factor(ifelse(TYPE_PEAU=="Mixte", 1, 0)),
                 TYPE_PEAU_N=as.factor(ifelse(TYPE_PEAU=="Normale", 1, 0)),
                 TYPE_PEAU_S=as.factor(ifelse(TYPE_PEAU=="Sèche", 1, 0)),
                 TYPE_PEAU_G=as.factor(ifelse(TYPE_PEAU=="Grasse", 1, 0)),
                 SENSIBILITE_A=as.factor(ifelse(SENSIBILITE=="Aucune", 1, 0)),
                 SENSIBILITE_O=as.factor(ifelse(SENSIBILITE=="Occasionnelle", 1, 0)),
                 SENSIBILITE_F=as.factor(ifelse(SENSIBILITE=="Fréquente", 1, 0)))]

But I think that this method is very long when I have many columns and modalities! So I am searching for more quickly and automated way using data.table operations to get an efficient result.

Thank you for your suggestions!

As you are already using data.table you can use a double dcast joined together and using substr inside dcast to get the desired column names:

# create a row number column first
DT[, rn := .I][]

# double dcast & join
dcast(DT, rn ~ paste0('TYPE_PEAU_', substr(TYPE_PEAU,1,1)), value.var = 'TYPE_PEAU',
      fun = length)[dcast(DT, rn ~ paste0('SENSIBILITE_', substr(SENSIBILITE,1,1)), value.var = 'SENSIBILITE', fun = length), on = .(rn)]

gives:

 rn TYPE_PEAU_G TYPE_PEAU_M TYPE_PEAU_N TYPE_PEAU_S SENSIBILITE_A SENSIBILITE_F SENSIBILITE_O 1: 1 0 1 0 0 0 1 0 2: 2 0 1 0 0 0 1 0 3: 3 0 0 0 1 0 1 0 4: 4 0 1 0 0 0 0 1 5: 5 0 1 0 0 0 0 1 6: 6 0 0 1 0 1 0 0 7: 7 1 0 0 0 1 0 0

When you want to include others columns, you could do:

dcast(DT, other + rn ~ paste0('TYPE_PEAU_', substr(TYPE_PEAU,1,1)), value.var = 'TYPE_PEAU',
      fun = length)[dcast(DT, other + rn ~ paste0('SENSIBILITE_', substr(SENSIBILITE,1,1)), value.var = 'SENSIBILITE', fun = length)
                    , on = .(rn, other)]

Or another option for when you want to include all columns:

tp <- dcast(DT, rn ~ paste0('TYPE_PEAU_', substr(TYPE_PEAU,1,1)), value.var = 'TYPE_PEAU', fun = length)
sen <- dcast(DT, rn ~ paste0('SENSIBILITE_', substr(SENSIBILITE,1,1)), value.var = 'SENSIBILITE', fun = length)

DT[tp,  on = .(rn)][sen, on = .(rn)]

Used data:

DT <- fread("SENSIBILITE TYPE_PEAU
Fréquente     Mixte
Fréquente     Mixte
Fréquente     Sèche
Occasionnelle     Mixte
Occasionnelle     Mixte
Aucune   Normale
Aucune   Grasse")[, other := sample(LETTERS, 7)]

You can use dcast.data.table for each SENSIBILITE and TYPE_PEAU. Then merge the results.

d1 <- dcast.data.table(dat, I ~ TYPE_PEAU, length)
setnames(d1, names(d1)[-1], paste0("TYPE_PEAU_", names(d1)[-1]))

d2 <- dcast.data.table(dat, I ~ SENSIBILITE, length)
setnames(d2, names(d2)[-1], paste0("SENSIBILITE_", names(d2)[-1]))

d1[d2, on=.(I)][, 
    I := NULL]

data:

dat <- fread("SENSIBILITE TYPE_PEAU
    Fréquente     Mixte
    Fréquente     Mixte
    Fréquente     Sèche
Occasionnelle     Mixte
Occasionnelle     Mixte
       Aucune   Normale")[,
           I := .I]

Just create dummy for each and cbind:

 library(dummies)

 data <- data.frame(
     SENSIBILITE=c("Fréquente", "Fréquente", "Fréquente", "Occasionnelle", "Occasionnelle", "Aucune "),
     TYPE_PEAU=c("Mixte", "Mixte", "Sèche", "Mixte", "Mixte", "Normale")
 )
 s <- dummy(data$SENSIBILITE, sep = "_")
 t <- dummy(data$TYPE_PEAU, sep = "_")
 cbind(s, t)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM