Exclude duplicated values from different rows of similar id

Question

I have a dataset where different values are located in one column. I know it's not a good practice but that is beyond my control. An example dataset is as follows:

library(data.table)
a1 <- data.table(v1 = "a", v2 = "12,13,12,12,10")
a2 <- data.table(v1 = "b", v2 = "10,10,11,12")
a3 <- data.table(v1 = "b", v2 = "10,10,13,14,12")
DT <- rbindlist(list(a1, a2, a3))

I would like to create a new column with only the unique values in "b" from both rows. I have tried this:

DT[, v5 := paste(unlist(lapply(v2, function(x) unique(unlist(strsplit(as.character(x), ",", fixed = TRUE))))), collapse = ","), by = v1]

But it only exclude duplicated values in each row. What I got is:

   v1             v2                   v5
1:  a 12,13,12,12,10             12,13,10
2:  b    10,10,11,12 10,11,12,10,13,14,12
3:  b 10,10,13,14,12 10,11,12,10,13,14,12

The values that I hope to get in column "v5" for rows "b" are 10,11,12,13,14.

I appreciate it very much for guidance to solve the problem.

Answer 1

DT[DT[,toString(unique(scan(text = v2,sep = ","))),by=v1],on="v1"]
Read 5 items
Read 9 items
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14

You can include quiet=T so as not to print how many items read:

DT[DT[,toString(unique(scan(text = v2,sep = ",",quiet = T))),by=v1],on="v1"]
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14

DT[DT[,toString(unique(unlist(strsplit(v2,",")))),by=v1],on="v1"]
   v1             v2                 V1
1:  a 12,13,12,12,10         12, 13, 10
2:  b    10,10,11,12 10, 11, 12, 13, 14
3:  b 10,10,13,14,12 10, 11, 12, 13, 14

Using paste and unlist

 DT[DT[,.(V5=paste(unique(unlist(strsplit(v2,","))),collapse=",")),by=v1],on="v1"]
   v1             v2             V5
1:  a 12,13,12,12,10       12,13,10
2:  b    10,10,11,12 10,11,12,13,14
3:  b 10,10,13,14,12 10,11,12,13,14

Answer 2

You are pretty close to solution. You must summarize ( paste with collapse ) for a group before apply unique .

You can try to summarise by v1 as:

DT[, .(v5 = paste(unique(unlist(strsplit(paste(v2,collapse = ","),
                              split = ","))),collapse=",")), by = v1]
#    v1             v5
# 1:  a       12,13,10
# 2:  b 10,11,12,13,14

Exclude duplicated values from different rows of similar id

Question

2 answers

solution1
2 ACCPTED 2018-04-30 22:09:20

solution2
1 2018-04-30 22:20:29

Exclude duplicated values from different rows of similar id

Question

2 answers

solution1 2 ACCPTED 2018-04-30 22:09:20

solution2 1 2018-04-30 22:20:29

solution1
2 ACCPTED 2018-04-30 22:09:20

solution2
1 2018-04-30 22:20:29