Suppose that I have the column SENSIBILITE
and TYPE_PEAU
in the data.table DataIns
.
> unique(DataIns$SENSIBILITE)
[1] "Fréquente" "Occasionnelle" "Aucune"
> unique(DataIns$TYPE_PEAU)
[1] "Mixte" "Sèche" "Normale" "Grasse"
As you see, each column has many modalities. Then I want to create new columns based on its which have binary values according to the modalities for each observation. In another word, if I have:
> head(DataIns[,c("SENSIBILITE","TYPE_PEAU")])
SENSIBILITE TYPE_PEAU
1: Fréquente Mixte
2: Fréquente Mixte
3: Fréquente Sèche
4: Occasionnelle Mixte
5: Occasionnelle Mixte
6: Aucune Normale
I need to get as result:
> head(DataIns)
TYPE_PEAU_M TYPE_PEAU_N TYPE_PEAU_S TYPE_PEAU_G SENSIBILITE_A SENSIBILITE_O SENSIBILITE_F
1: 1 0 0 0 0 0 1
2: 1 0 0 0 0 0 1
3: 0 0 1 0 0 0 1
4: 1 0 0 0 0 1 0
5: 1 0 0 0 0 1 0
6: 0 1 0 0 1 0 0
I get the result above using this code:
DataIns<-DataIns[,.(TYPE_PEAU_M=as.factor(ifelse(TYPE_PEAU=="Mixte", 1, 0)),
TYPE_PEAU_N=as.factor(ifelse(TYPE_PEAU=="Normale", 1, 0)),
TYPE_PEAU_S=as.factor(ifelse(TYPE_PEAU=="Sèche", 1, 0)),
TYPE_PEAU_G=as.factor(ifelse(TYPE_PEAU=="Grasse", 1, 0)),
SENSIBILITE_A=as.factor(ifelse(SENSIBILITE=="Aucune", 1, 0)),
SENSIBILITE_O=as.factor(ifelse(SENSIBILITE=="Occasionnelle", 1, 0)),
SENSIBILITE_F=as.factor(ifelse(SENSIBILITE=="Fréquente", 1, 0)))]
But I think that this method is very long when I have many columns and modalities! So I am searching for more quickly and automated way using data.table operations to get an efficient result.
Thank you for your suggestions!
As you are already using data.table
you can use a double dcast
joined together and using substr
inside dcast
to get the desired column names:
# create a row number column first
DT[, rn := .I][]
# double dcast & join
dcast(DT, rn ~ paste0('TYPE_PEAU_', substr(TYPE_PEAU,1,1)), value.var = 'TYPE_PEAU',
fun = length)[dcast(DT, rn ~ paste0('SENSIBILITE_', substr(SENSIBILITE,1,1)), value.var = 'SENSIBILITE', fun = length), on = .(rn)]
gives:
rn TYPE_PEAU_G TYPE_PEAU_M TYPE_PEAU_N TYPE_PEAU_S SENSIBILITE_A SENSIBILITE_F SENSIBILITE_O 1: 1 0 1 0 0 0 1 0 2: 2 0 1 0 0 0 1 0 3: 3 0 0 0 1 0 1 0 4: 4 0 1 0 0 0 0 1 5: 5 0 1 0 0 0 0 1 6: 6 0 0 1 0 1 0 0 7: 7 1 0 0 0 1 0 0
When you want to include others columns, you could do:
dcast(DT, other + rn ~ paste0('TYPE_PEAU_', substr(TYPE_PEAU,1,1)), value.var = 'TYPE_PEAU',
fun = length)[dcast(DT, other + rn ~ paste0('SENSIBILITE_', substr(SENSIBILITE,1,1)), value.var = 'SENSIBILITE', fun = length)
, on = .(rn, other)]
Or another option for when you want to include all columns:
tp <- dcast(DT, rn ~ paste0('TYPE_PEAU_', substr(TYPE_PEAU,1,1)), value.var = 'TYPE_PEAU', fun = length)
sen <- dcast(DT, rn ~ paste0('SENSIBILITE_', substr(SENSIBILITE,1,1)), value.var = 'SENSIBILITE', fun = length)
DT[tp, on = .(rn)][sen, on = .(rn)]
Used data:
DT <- fread("SENSIBILITE TYPE_PEAU
Fréquente Mixte
Fréquente Mixte
Fréquente Sèche
Occasionnelle Mixte
Occasionnelle Mixte
Aucune Normale
Aucune Grasse")[, other := sample(LETTERS, 7)]
You can use dcast.data.table
for each SENSIBILITE and TYPE_PEAU. Then merge the results.
d1 <- dcast.data.table(dat, I ~ TYPE_PEAU, length)
setnames(d1, names(d1)[-1], paste0("TYPE_PEAU_", names(d1)[-1]))
d2 <- dcast.data.table(dat, I ~ SENSIBILITE, length)
setnames(d2, names(d2)[-1], paste0("SENSIBILITE_", names(d2)[-1]))
d1[d2, on=.(I)][,
I := NULL]
data:
dat <- fread("SENSIBILITE TYPE_PEAU
Fréquente Mixte
Fréquente Mixte
Fréquente Sèche
Occasionnelle Mixte
Occasionnelle Mixte
Aucune Normale")[,
I := .I]
Just create dummy for each and cbind:
library(dummies)
data <- data.frame(
SENSIBILITE=c("Fréquente", "Fréquente", "Fréquente", "Occasionnelle", "Occasionnelle", "Aucune "),
TYPE_PEAU=c("Mixte", "Mixte", "Sèche", "Mixte", "Mixte", "Normale")
)
s <- dummy(data$SENSIBILITE, sep = "_")
t <- dummy(data$TYPE_PEAU, sep = "_")
cbind(s, t)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.