简体   繁体   中英

Find repeating groups in r data.table

I need to identify and de-duplicate groups of records in an r data table (but I suppose the issue would be the same in any programming language), structured like the following:

输入数据表

Groups are identified by the values in var1 and var2 and they are duplicates if they have the same size and contain the same values in var2 and var3 (the values in var3 are what bigger groups identified by var1 and var2 have in common).

So in the example the 2 red groups are duplicates, but the pair (red,blue) and the pair (red,brown) are not.

My solution consists in transposing the table to wide format

转置数据表

and then do unique(dt[,var1:=NULL]) and transpose back to long format (I will not need var1 any longer at this point).

The problem is that my real table has 165,391,868 records and it's not a one-off task but a weekly one with similarly sized tables and limited time to do it.

I have tried splitting the table into chunks, appending them and then do the de-duplication but the first transpose has now been running for more than 2h!

Any alternative and fastest solution? Thank you very much!

Code to create the example table:

dt <- data.table(
var1=c(
    "value1_1",
    "value1_1",
    "value1_1",
    "value1_2",
    "value1_2",
    "value1_2",
    "value1_2",
    "value1_3",
    "value1_3",
    "value1_3",
    "value1_4",
    "value1_4",
    "value1_4",
    "value1_5",
    "value1_5",
    "value1_5",
    "value1_5"),
var2=c(
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1",
    "value2_1"),
var1=c(
    "value3_1",
    "value3_2",
    "value3_3",
    "value3_2",
    "value3_4",
    "value3_5",
    "value3_6",
    "value3_1",
    "value3_2",
    "value3_3",
    "value3_1",
    "value3_2",
    "value3_4",
    "value3_1",
    "value3_2",
    "value3_3",
    "value3_5"))

Here are 2 other options:

1) Collapsing var3 into a single value for joining

lu <- dt[, paste(var3, collapse=""), .(var1, var2)]

samegrp <- lu[lu, on=.(V1)][
    var1!=i.var1 & var2==i.var2, 
    .(var1=c(var11, var12), g=.GRP),
    .(var11=pmin(var1, i.var1), var12=pmax(var1, i.var1), var2)]

dt[samegrp, on=.(var1, var2), g := g]

output:

        var1     var2     var3  g
 1: value1_1 value2_1 value3_1  1
 2: value1_1 value2_1 value3_2  1
 3: value1_1 value2_1 value3_3  1
 4: value1_2 value2_1 value3_2 NA
 5: value1_2 value2_1 value3_4 NA
 6: value1_2 value2_1 value3_5 NA
 7: value1_2 value2_1 value3_6 NA
 8: value1_3 value2_1 value3_1  1
 9: value1_3 value2_1 value3_2  1
10: value1_3 value2_1 value3_3  1
11: value1_4 value2_1 value3_1 NA
12: value1_4 value2_1 value3_2 NA
13: value1_4 value2_1 value3_4 NA
14: value1_5 value2_1 value3_1 NA
15: value1_5 value2_1 value3_2 NA
16: value1_5 value2_1 value3_3 NA
17: value1_5 value2_1 value3_5 NA

2) Matching counts:

setkey(dt, var1, var2, var3)
count <- dt[, .N, .(var1, var2)]

matches <- dt[dt, on=.(var2, var3), allow.cartesian=TRUE, nomatch=0L][
    var1!=i.var1,
    .(N=.N / 2, g=.GRP),
    .(var11=pmin(i.var1, var1), var12=pmax(i.var1, var1), var2)]

matches[count, on=.(var11=var1, var2, N), nomatch=0L][
    count, on=.(var12=var1, var2, N), nomatch=0L]

output:

      var11    var12     var2 N g
1: value1_1 value1_3 value2_1 3 1

The 2nd method is more memory intensive and hence might be slower. But actual performance really depends on the characteristics of the actual dataset. Eg the data types of the columns, the number of unique pairs of var1 and var2 , the number of unique values of var3 , etc.

I think I have a solution but let me know if it doesn't work and I'll have another crack.

I have just edited in response to your comment by adding var2 to the id column

First make a column for the groups based on var1 and var2

dt[,group:=paste0(var1, var2)]

Then you make an id based on the var3 and the size

dt[,id:=paste0(paste(sort(var3), collapse=""), var2, .N), by=group]

Then you label each group with a number based on whether it is the first, second, third etc time you have seen a group with that id

dt[,groupN:=as.numeric(factor(group)), by=id]

Then keep only the first time you see each group

dt[groupN==1]

This works, but I have no idea of its efficiency (in all honesty, its probably slower but its a different approach). I had built the multifilter function for another project and it occurred to me to use it here. multifilter splits the dataframe into a list of data frames according to unique combinations of variables found in whatever columns you supply to it. We then check for duplicated var 3 cols and remove them. Finally the dataset is rebound.

multifilter <- function(data,filterorder){  
  newdata <- list(data)
  for(i in rev(filterorder)){
    newdata <- unlist(lapply(sort(unique(data[,i])), function(x) lapply(newdata, function(y) y[y[,i]==x,])),recursive=F)
  }
  return(newdata[sapply(newdata,nrow)>=1])
}


filtereddt <- multifilter(dt,c("var1","var2"))
filtereddt <- filtereddt[-duplicated(lapply(filtereddt, function(x) x[,3]))]
filtereddt <- do.call(rbind, filtereddt)[,-1]

output:

> filtereddt
       var2     var3
4  value2_1 value3_2
5  value2_1 value3_4
6  value2_1 value3_5
7  value2_1 value3_6
8  value2_1 value3_1
9  value2_1 value3_2
10 value2_1 value3_3
11 value2_2 value3_1
12 value2_2 value3_2
13 value2_2 value3_4
14 value2_1 value3_1
15 value2_1 value3_2
16 value2_1 value3_3
17 value2_1 value3_5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM