简体   繁体   中英

Exclude duplicated values from different rows of similar id

I have a dataset where different values are located in one column. I know it's not a good practice but that is beyond my control. An example dataset is as follows:

library(data.table)
a1 <- data.table(v1 = "a", v2 = "12,13,12,12,10")
a2 <- data.table(v1 = "b", v2 = "10,10,11,12")
a3 <- data.table(v1 = "b", v2 = "10,10,13,14,12")
DT <- rbindlist(list(a1, a2, a3))

I would like to create a new column with only the unique values in "b" from both rows. I have tried this:

DT[, v5 := paste(unlist(lapply(v2, function(x) unique(unlist(strsplit(as.character(x), ",", fixed = TRUE))))), collapse = ","), by = v1]

But it only exclude duplicated values in each row. What I got is:

   v1             v2                   v5
1:  a 12,13,12,12,10             12,13,10
2:  b    10,10,11,12 10,11,12,10,13,14,12
3:  b 10,10,13,14,12 10,11,12,10,13,14,12

The values that I hope to get in column "v5" for rows "b" are 10,11,12,13,14.

I appreciate it very much for guidance to solve the problem.

DT[DT[,toString(unique(scan(text = v2,sep = ","))),by=v1],on="v1"]
Read 5 items
Read 9 items
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14

You can include quiet=T so as not to print how many items read:

DT[DT[,toString(unique(scan(text = v2,sep = ",",quiet = T))),by=v1],on="v1"]
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14

DT[DT[,toString(unique(unlist(strsplit(v2,",")))),by=v1],on="v1"]
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14

Using paste and unlist

 DT[DT[,.(V5=paste(unique(unlist(strsplit(v2,","))),collapse=",")),by=v1],on="v1"]
   v1             v2             V5
1:  a 12,13,12,12,10       12,13,10
2:  b    10,10,11,12 10,11,12,13,14
3:  b 10,10,13,14,12 10,11,12,13,14

You are pretty close to solution. You must summarize ( paste with collapse ) for a group before apply unique .

You can try to summarise by v1 as:

DT[, .(v5 = paste(unique(unlist(strsplit(paste(v2,collapse = ","),
                              split = ","))),collapse=",")), by = v1]
#    v1             v5
# 1:  a       12,13,10
# 2:  b 10,11,12,13,14

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM