简体   繁体   English

R data.table:通过键和组将多列与data.table相交的最快方法是什么

[英]R data.table: what is the fastest way to intersect a data.table by multiple columns by keys and groups

MAJOR EDIT to clarify as answers are wrong 主要编辑以澄清错误答案

I have a data.table with group columns (split_by), key columns (key_by) and trait ids columns (intersect_by) 我有一个带有组列(split_by),键列(key_by)和特征ID列(intersect_by)的data.table

I want in each group of split_by, keep only the rows where the trait ids are shared by all the present keys in the group. 我希望在split_by的每组中,仅保留该组中所有当前键共享特征ID的行。

For example: 例如:

dt <- data.table(id = 1:6, key1 = 1, key2 = c(1:2, 2), group_id1= 1, group_id2= c(1:2, 2:1, 1:2), trait_id1 = 1, trait_id2 = 2:1)
setkey(dt, group_id1, group_id2, trait_id1, trait_id2)
dt
   id key1 key2 group_id1 group_id2 trait_id1 trait_id2
1:  4    1    1         1         1         1         1
2:  1    1    1         1         1         1         2
3:  5    1    2         1         1         1         2
4:  2    1    2         1         2         1         1
5:  6    1    2         1         2         1         1
6:  3    1    2         1         2         1         2

res <- intersect_this_by(dt,
                         key_by = c("key1"),
                         split_by = c("group_id1", "group_id2"),
                         intersect_by = c("trait_id1", "trait_id2"))

I want res to be like this: 我希望res像这样:

> res[]
   id key1 key2 group_id1 group_id2 trait_id1 trait_id2
1:  1    1    1         1         1         1         2
2:  5    1    2         1         1         1         2
3:  2    1    2         1         2         1         1
4:  6    1    2         1         2         1         1
5:  3    1    2         1         2         1         2

We see id 4 has been dropped as in group_id1 = 1 and group_id2 = 1 combination group (the group which id 4 belongs) there is only one combination of keys (1,1) which has these traits (1,1) whereas there are two keys combinations in this group: (1,1) and (1,2) so the traits (1,1) are not shared by all keys in this group so we drop this trait from this group, hence drop id 4. On the contrary, id 1 and 5 have same traits but different keys and they represent all the keys ( (1,1) and (1,2)) in this group so traits of id 1 and 5 are kept. 我们看到id 4被删除,因为group_id1 = 1和group_id2 = 1组合组(id 4所属的组),只有键(1,1)的一个组合具有这些特征(1,1),而有该组中的两个键组合:(1,1)和(1,2),因此特征(1,1)未被该组中的所有键共享,因此我们从该组中删除此特征,因此删除id 4。相反,id 1和5具有相同的特征,但键不同,它们代表该组中的所有键((1,1)和(1,2)),因此保留了id 1和5的特征。

A function to achieve this is given there: 此处提供了实现此功能的功能:

intersect_this_by2 <- function(dt,
                               key_by = NULL,
                               split_by = NULL,
                               intersect_by = NULL){

    dtc <- as.data.table(dt)       

    # compute number of keys in the group
    dtc[, n_keys := uniqueN(.SD), by = split_by, .SDcols = key_by]
    # compute number of keys represented by each trait in each group 
    # and keep row only if they represent all keys from the group
    dtc[, keep := n_keys == uniqueN(.SD), by = c(intersect_by, split_by), .SDcols = key_by]
    dtc <- dtc[keep == TRUE][, c("n_keys", "keep") := NULL]
    return(dtc)      
}

But it gets quite slow for big datasets or complex traits/keys/groups... the real data.table has got 10 millions rows and the traits have 30 levels... Is there any way to improve it? 但是对于大型数据集或复杂的特征/键/组来说,它变得相当慢...真正的data.table有1000万行,特征具有30个级别...有什么方法可以改善它? Any obvious pitfalls? 有明显的陷阱吗? Thanks for the help 谢谢您的帮助

FINAL EDIT: Uwe proposed a concise solution which is 40% faster than my initial code (which I deleted here because it was confusing) The final function looks like this: 最终编辑: Uwe提出了一个简洁的解决方案,该解决方案比我的初始代码快40%(由于混淆,我在此处删除了它)最终功能如下所示:

intersect_this_by_uwe <- function(dt,
                                  key_by = c("key1"),
                                  split_by = c("group_id1", "group_id2"),
                                  intersect_by = c("trait_id1", "trait_id2")){
    dti <- copy(dt)
    dti[, original_order_id__ := 1:.N]
    setkeyv(dti, c(split_by, intersect_by, key_by))
    uni <- unique(dti, by = c(split_by, intersect_by, key_by))
    unique_keys_by_group <-
        unique(uni, by = c(split_by, key_by))[, .N, by = c(split_by)]
    unique_keys_by_group_and_trait <-
        uni[, .N, by = c(split_by, intersect_by)]
    # 1st join to pick group/traits combinations with equal number of unique keys
    selected_groups_and_traits <-
        unique_keys_by_group_and_trait[unique_keys_by_group,
                                       on = c(split_by, "N"), nomatch = 0L]
    # 2nd join to pick records of valid subsets
    dti[selected_groups_and_traits, on = c(split_by, intersect_by)][
        order(original_order_id__), -c("original_order_id__","N")]
}

And for the records the benchmarks on the 10M rows dataset: 对于记录,在1000万行数据集上的基准测试:

> microbenchmark::microbenchmark(old_way = {res <- intersect_this_by(dt,
+                                                                    key_by = c("key1"),
+                                                                    split_by = c("group_id1", "group_id2"),
+                                                                    intersect_by = c("trait_id1", "trait_id2"))},
+                                new_way = {res <- intersect_this_by2(dt,
+                                                                     key_by = c("key1"),
+                                                                     split_by = c("group_id1", "group_id2"),
+                                                                     intersect_by = c("trait_id1", "trait_id2"))},
+                                new_way_uwe = {res <- intersect_this_by_uwe(dt,
+                                                                            key_by = c("key1"),
+                                                                            split_by = c("group_id1", "group_id2"),
+                                                                            intersect_by = c("trait_id1", "trait_id2"))},
+                                times = 10)
Unit: seconds
        expr       min        lq      mean    median        uq       max neval cld
     old_way  3.145468  3.530898  3.514020  3.544661  3.577814  3.623707    10  b 
     new_way 15.670487 15.792249 15.948385 15.988003 16.097436 16.206044    10   c
 new_way_uwe  1.982503  2.350001  2.320591  2.394206  2.412751  2.436381    10 a  

EDIT 编辑

Although the answer below does reproduce the expected result for the small sample dataset it fails to give the correct answer for the large, 10 M rows dataset provided by the OP. 尽管下面的答案的确重现了小样本数据集的预期结果,但未能为 OP提供的1000万行大数据集提供正确的答案

However, I have decided to keep this wrong answer because of the benchmark results which show the poor performance of the uniqueN() function. 但是,由于基准测试结果表明uniqueN()函数的性能较差,所以我决定保留这个错误答案。 In addition, the answer contains benchmarks of much faster, alternative solutions. 此外,答案还包含更快,替代解决方案的基准。



If I understand correctly, the OP wants to keep only those rows where the unique combinations of group_id1 , group_id2 , trait_id1 , and trait_id2 appear in more than one distinct key1 . 如果我理解正确,OP将只保留那些行,其中group_id1group_id2trait_id1trait_id2的唯一组合出现在多个不同的key1

This can be achieved by counting the unique values of key1 in each group of group_id1 , group_id2 , trait_id1 , and trait_id2 and by selecting only those combinations of group_id1 , group_id2 , trait_id1 , and trait_id2 where the count is larger than one. 这可以通过对group_id1group_id2trait_id1trait_id2每个组中的key1的唯一值进行计数,并仅选择那些计数大于一个的group_id1group_id2trait_id1trait_id2组合来group_id1 Finally, the matching rows are retrieved by joining: 最后,通过加入来检索匹配的行:

library(data.table)
sel <- dt[, uniqueN(key1), by = .(group_id1, group_id2, trait_id1, trait_id2)][V1 > 1]
sel
  group_id1 group_id2 trait_id1 trait_id2 V1 1: 1 2 3 1 2 2: 2 2 2 1 2 3: 2 1 1 2 2 4: 1 1 1 1 2 5: 1 1 2 2 2 6: 2 2 2 2 2 7: 1 1 1 2 2 8: 1 1 3 2 2 
res <- dt[sel, on = .(group_id1, group_id2, trait_id1, trait_id2)][order(id), -"V1"]
res
  id key1 group_id1 trait_id1 group_id2 trait_id2 extra 1: 1 2 1 3 2 1 u 2: 2 1 2 2 2 1 g 3: 5 2 2 1 1 2 g 4: 8 2 1 3 2 1 o 5: 9 2 1 1 1 1 d 6: 10 2 2 1 1 2 g 7: 13 1 2 1 1 2 c 8: 14 2 1 2 1 2 t 9: 15 1 1 3 2 1 y 10: 16 2 1 3 2 1 v 11: 19 2 2 2 2 2 y 12: 22 2 2 2 2 1 g 13: 24 2 1 1 1 2 i 14: 25 1 1 3 1 2 n 15: 26 1 2 2 2 2 y 16: 27 1 1 1 1 1 n 17: 28 1 1 1 1 2 h 18: 29 1 2 2 2 2 b 19: 30 2 1 3 1 2 k 20: 31 1 2 2 2 2 w 21: 35 1 1 2 1 2 q 22: 37 2 2 1 1 2 r 23: 39 1 1 1 1 2 o id key1 group_id1 trait_id1 group_id2 trait_id2 extra 

This reproduces OP's expected result but is it also the fastest way as requested by the OP? 这重现了OP的预期结果,但这是否也是OP要求的最快方法


Benchmarking Part 1 基准测试第1部分

OP's code to create benchmark data (but with 1 M rows instead of 10 M rows) is used here: 在这里使用OP的代码创建基准数据(但使用1 M行而不是10 M行):

set.seed(0)
n <- 1e6
p <- 1e5
m <- 5
dt <- data.table(id = 1:n,
                 key1 = sample(1:m, size = n, replace = TRUE),
                 group_id1 = sample(1:2, size = n, replace = TRUE),
                 trait_id1 = sample(1:p, size = n, replace = TRUE),
                 group_id2 = sample(1:2, size = n, replace = TRUE),
                 trait_id2 = sample(1:2, size = n, replace = TRUE),
                 extra = sample(letters, n, replace = TRUE))

I was quite surprised to find that the solution using uniqueN() is not the fastest one: 我很惊讶地发现使用uniqueN()的解决方案不是最快的解决方案:

 Unit: milliseconds expr min lq mean median uq max neval cld old_way 489.4606 496.3801 523.3361 503.2997 540.2739 577.2482 3 a new_way 9356.4131 9444.5698 9567.4035 9532.7265 9672.8987 9813.0710 3 c uwe1 5946.4533 5996.7388 6016.8266 6047.0243 6052.0133 6057.0023 3 b 

Benchmark code: 基准代码:

microbenchmark::microbenchmark(
  old_way = {
    DT <- copy(dt)
    res <- intersect_this_by(DT,
                             key_by = c("key1"),
                             split_by = c("group_id1", "group_id2"),
                             intersect_by = c("trait_id1", "trait_id2"))
  },
  new_way = {
    DT <- copy(dt)
    res <- intersect_this_by2(DT,
                              key_by = c("key1"),
                              split_by = c("group_id1", "group_id2"),
                              intersect_by = c("trait_id1", "trait_id2"))
  },
  uwe1 = {
    DT <- copy(dt)
    sel <- DT[, uniqueN(key1), by = .(group_id1, group_id2, trait_id1, trait_id2)][V1 > 1]
    res <- DT[sel, on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id)]
  },
  times = 3L)

Note that for each run a fresh copy of the benchmark data is used in order to avoid any side effects from previous runs, eg, indices set by data.table . 请注意,每次运行都会使用基准数据的新副本,以避免先前运行产生任何副作用,例如data.table设置的data.table

Switching verbose mode on 开启详细模式

options(datatable.verbose = TRUE)

reveals that most of the time is spent in computing uniqueN() for all the groups: 揭示了大部分时间都花在为所有组计算uniqueN()

 sel <- DT[, uniqueN(key1), by = .(group_id1, group_id2, trait_id1, trait_id2)][V1 > 1] Detected that j uses these columns: key1 Finding groups using forderv ... 0.060sec Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec Getting back original order ... 0.050sec lapply optimization is on, j unchanged as 'uniqueN(key1)' GForce is on, left j unchanged Old mean optimization is on, left j unchanged. Making each group and running j (GForce FALSE) ... collecting discontiguous groups took 0.084s for 570942 groups eval(j) took 5.505s for 570942 calls 5.940sec 

This is a known issue . 这是一个已知问题 However, the alternative lenght(unique()) (for which uniqueN() is an abbreviation) brings only a moderate speed-up of 2. 但是,替代的lenght(unique())uniqueN()是其缩写)仅带来2的中等加速。

So I started to look for ways to avoid uniqueN() or lenght(unique()) . 因此,我开始寻找避免uniqueN()lenght(unique())


Benchmarking Part 2 基准测试第2部分

I have found two alternatives which are sufficiently fast. 我已经找到了足够快的两种选择。 Both create a data.table of unique combinations of group_id1 , group_id2 , trait_id1 , trait_id2 , and key1 in a first step, count the number of distinct key1 values for each group of group_id1 , group_id2 , trait_id1 , trait_id2 , and filter for counts greater one: 两个创建的的独特组合一个data.table group_id1group_id2trait_id1trait_id2 key1在第一步骤中,计算不同的数目key1值对于每个组的group_id1group_id2trait_id1trait_id2用于计数大于一个,和过滤器:

sel <- DT[, .N, by = .(group_id1, group_id2, trait_id1, trait_id2, key1)][
  , .N, by = .(group_id1, group_id2, trait_id1, trait_id2)][N > 1]

and

sel <- unique(DT, by = c("group_id1", "group_id2", "trait_id1", "trait_id2", "key1"))[
  , .N, by = .(group_id1, group_id2, trait_id1, trait_id2)][N > 1]

The verbose output shows that computing times for these variants are significantly better. 详细的输出表明,这些变体的计算时间明显更好。

For benchmarking, only the fastest methods are used but now with 10 M rows. 对于基准测试,仅使用最快的方法,但现在有1000万行。 In addition, each variant is tried with setkey() and setorder() , resp., applied beforehand: 此外,每个变体都可以通过分别应用的setkey()setorder()尝试:

microbenchmark::microbenchmark(
  old_way = {
    DT <- copy(dt)
    res <- intersect_this_by(DT,
                             key_by = c("key1"),
                             split_by = c("group_id1", "group_id2"),
                             intersect_by = c("trait_id1", "trait_id2"))
  },
  uwe3 = {
    DT <- copy(dt)
    sel <- DT[, .N, by = .(group_id1, group_id2, trait_id1, trait_id2, key1)][
      , .N, by = .(group_id1, group_id2, trait_id1, trait_id2)][N > 1]
    res <- DT[sel, on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id)]
  },
  uwe3k = {
    DT <- copy(dt)
    setkey(DT, group_id1, group_id2, trait_id1, trait_id2, key1)
    sel <- DT[, .N, by = .(group_id1, group_id2, trait_id1, trait_id2, key1)][
      , .N, by = .(group_id1, group_id2, trait_id1, trait_id2)][N > 1]
    res <- DT[sel, on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id)]
  },
  uwe3o = {
    DT <- copy(dt)
    setorder(DT, group_id1, group_id2, trait_id1, trait_id2, key1)
    sel <- DT[, .N, by = .(group_id1, group_id2, trait_id1, trait_id2, key1)][
      , .N, by = .(group_id1, group_id2, trait_id1, trait_id2)][N > 1]
    res <- DT[sel, on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id)]
  },
  uwe4 = {
    DT <- copy(dt)
    sel <- unique(DT, by = c("group_id1", "group_id2", "trait_id1", "trait_id2", "key1"))[
      , .N, by = .(group_id1, group_id2, trait_id1, trait_id2)][N > 1]
    res <- DT[sel, on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id)]
  },
  uwe4k = {
    DT <- copy(dt)
    setkey(DT, group_id1, group_id2, trait_id1, trait_id2, key1)
    sel <- unique(DT, by = c("group_id1", "group_id2", "trait_id1", "trait_id2", "key1"))[
      , .N, by = .(group_id1, group_id2, trait_id1, trait_id2)][N > 1]
    res <- DT[sel, on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id)]
  },
  uwe4o = {
    DT <- copy(dt)
    setorder(DT, group_id1, group_id2, trait_id1, trait_id2, key1)
    sel <- unique(DT, by = c("group_id1", "group_id2", "trait_id1", "trait_id2", "key1"))[
      , .N, by = .(group_id1, group_id2, trait_id1, trait_id2)][N > 1]
    res <- DT[sel, on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id)]
  },
  times = 3L)

The benchmark results for the 10 M case show that both variants are faster than OP's intersect_this_by() function and that keying and ordering is pushing speed-up (with a minimal advantage for ordering). 在10 M情况下的基准测试结果表明,这两种变体都比OP的intersect_this_by()函数要快,并且键控和排序正在加快速度(对排序的好处很小)。

 Unit: seconds expr min lq mean median uq max neval cld old_way 7.173517 7.198064 7.256211 7.222612 7.297559 7.372506 3 d uwe3 6.820324 6.833151 6.878777 6.845978 6.908003 6.970029 3 c uwe3k 5.349949 5.412018 5.436806 5.474086 5.480234 5.486381 3 a uwe3o 5.423440 5.432562 5.467376 5.441683 5.489344 5.537006 3 a uwe4 6.270724 6.276757 6.301774 6.282790 6.317299 6.351807 3 b uwe4k 5.280763 5.295251 5.418803 5.309739 5.487823 5.665906 3 a uwe4o 4.921627 5.095762 5.157010 5.269898 5.274702 5.279506 3 a 

With the additional explanations by the OP , I believe to have gained a better understanding of the problem. 通过OP附加解释 ,我相信已经对该问题有了更好的理解。

The OP wants to remove incomplete subsets from his dataset. OP希望从他的数据集中删除不完整的子集。 Each group_id1 , group_id2 group contains a set of unique key1 values. 每个group_id1group_id2组均包含一组唯一的key1值。 A complete subset contains at least one group_id1 , group_id2 , trait_id1 , trait_id2 , key1 record for each of the key1 values in the group_id1 , group_id2 group. 一个完整的子集包含针对group_id1group_id2组中每个 key1值的至少一个group_id1group_id2trait_id1trait_id2key1记录。

It is not necessary to check the key1 values when comparing the grouping on the group_id1 , group_id2 , trait_id1 , trait_id2 level with the group_id1 , group_id2 level. 这是没有必要检查key1上比较分组时 group_id1group_id2trait_id1trait_id2水平与group_id1group_id2水平。 It is sufficient to check if the number of distinct key1 values is equal. 检查不同的key1值的数量是否相等就足够了。

So, the solution below follows the general outline of OP's own answer but uses two joins to achieve the result: 因此,以下解决方案遵循OP自己的答案的概述,但是使用两个联接来实现结果:

setkey(dt, group_id1, group_id2, trait_id1, trait_id2, key1)
uni <- unique(dt, by = c("group_id1", "group_id2", "trait_id1", "trait_id2", "key1"))
unique_keys_by_group <- 
  unique(uni, by = c("group_id1", "group_id2", "key1"))[, .N, by = .(group_id1, group_id2)]
unique_keys_by_group_and_trait <- 
  uni[, .N, by = .(group_id1, group_id2, trait_id1, trait_id2)]
# 1st join to pick group/traits combinations with equal number of unique keys
selected_groups_and_traits <- 
  unique_keys_by_group_and_trait[unique_keys_by_group, 
                                 on = .(group_id1, group_id2, N), nomatch = 0L]
# 2nd join to pick records of valid subsets
res <- dt[selected_groups_and_traits, on = .(group_id1, group_id2, trait_id1, trait_id2)][
  order(id), -"N"]

It can be verified that the result is identical to OP's result: 可以验证结果是否与OP的结果相同:

identical(
  intersect_this_by(dt,
                    key_by = c("key1"),
                    split_by = c("group_id1", "group_id2"),
                    intersect_by = c("trait_id1", "trait_id2")),
  res)
 [1] TRUE 

Note that the uniqueN() function is not used due to performance issues as shown in the benchmarks of my first (wrong) answer . 请注意,由于性能问题, 使用uniqueN()函数,如我的第一个(错误的)答案的基准所示。

Benchmark comparison 基准比较

OP's benchmark data is used (10 M rows). 使用OP的基准数据(1000万行)。

library(microbenchmark)
mb <- microbenchmark(
  old_way = {
    DT <- copy(dt)
    res <- intersect_this_by(DT,
                             key_by = c("key1"),
                             split_by = c("group_id1", "group_id2"),
                             intersect_by = c("trait_id1", "trait_id2"))
  },
  uwe = {
    DT <- copy(dt)
    setkey(DT, group_id1, group_id2, trait_id1, trait_id2, key1)
    uni <- 
      unique(DT, by = c("group_id1", "group_id2", "trait_id1", "trait_id2", "key1"))
    unique_keys_by_group <- 
      unique(uni, by = c("group_id1", "group_id2", "key1"))[
        , .N, by = .(group_id1, group_id2)]
    unique_keys_by_group_and_trait <- 
      uni[, .N, by = .(group_id1, group_id2, trait_id1, trait_id2)]
    selected_groups_and_traits <- 
      unique_keys_by_group_and_trait[unique_keys_by_group, 
                                     on = .(group_id1, group_id2, N), nomatch = 0L]
    res <- DT[selected_groups_and_traits, 
              on = .(group_id1, group_id2, trait_id1, trait_id2)][
      order(id), -"N"]
  },
  times = 3L)
mb

The solution presented here is 40% faster: 这里介绍的解决方案快40%:

 Unit: seconds expr min lq mean median uq max neval cld old_way 7.251277 7.315796 7.350636 7.380316 7.400315 7.420315 3 b uwe 4.379781 4.461368 4.546267 4.542955 4.629510 4.716065 3 a 

Edit: Further performance improvements 编辑:进一步的性能改进

The Op has asked for ideas to further improve performance. 行动党要求提出进一步改善绩效的想法。

I already have tried different approaches including a double nested grouping (using slow uniqueN() just for simplified display of code): 我已经尝试了不同的方法,包括双重嵌套分组(使用慢的uniqueN()只是为了简化代码显示):

res <- DT[, {
  nuk_g = uniqueN(key1) 
  .SD[, if(nuk_g == uniqueN(key1)) .SD, by = .(trait_id1, trait_id2)]
}, by = .(group_id1, group_id2)][order(id)]

but they were all slower for the given benchmark data . 但对于给定的基准数据,它们的速度都较慢。

It is likely that perfomance of a particular method does not depend solely on the problem size , ie., the number of rows, but also on the problem structure eg, the number of different groups, treats, and keys as as well as on data types, etc. 特定方法的性能可能不仅仅取决于问题的大小 (即行数),还取决于问题的结构,例如不同组,处理和键的数量以及数据类型等

So, without knowing the structure of your production data and the context of your computational flow I do not think it is worthwhile to spent more time on benchmarking. 因此,在不了解生产数据的结构和计算流程的上下文的情况下,我认为没有必要花费更多时间进行基准测试。

Anyway, there is one suggestion: Make sure that setkey() is called only once as it is rather costly (about 2 seconds) but speeds-up all subsequent operations. 无论如何,有一个建议:确保setkey()仅被调用一次,因为它相当昂贵(大约2秒),但是会加快所有后续操作的速度。 (Verify with options(datatable.verbose = TRUE) ). (使用options(datatable.verbose = TRUE)验证)。

I'll start with a tidyverse approach and show the equivalent in data.table . 我将从tidyverse方法开始,并在data.table显示等效data.table

Let me know if this result isn't whats intended because it does differ from your required output - but its what you've described in text. 让我知道此结果是否不是预期的结果,因为它与所需的输出有所不同-而是您在文本中描述的结果。

1. Tidy approach 1.整洁的方法

Just creating a single column from the traits and then grouping by the grouping columns and the new combined traits. 只需根据特征创建单个列,然后按分组列和新组合的特征进行分组。 Filter for group frequency greater than 1. 过滤组频率大于1。

dt %>%
  mutate(comb = paste0(trait_id1, trait_id2)) %>%
  group_by(group_id1, group_id2, comb) %>%
  filter(n() > 1)

2. data.table approach 2.数据表方法

Much the same methodology as the prior tidy approach just written in data.table . 与以前在data.table编写的先前的整洁方法大致相同的方法。

Using answer from here to find fast paste methods. 使用此处的答案来查找快速粘贴方法。

dt[, comb := do.call(paste, c(.SD, sep = "")), .SDcols = c("trait_id1", "trait_id2")][, freq := .N, by = .(group_id1, group_id2, comb)][freq > 1]

Comparison 比较方式

Comparing the two methods, and Chinsoons comment the speeds are: 比较这两种方法,Chinsoons评述了速度:

microbenchmark::microbenchmark(zac_tidy = {
  dt %>%
    mutate(comb = paste0(trait_id1, trait_id2)) %>%
    group_by(group_id1, group_id2, comb) %>%
    filter(n() > 1)
},
zac_dt = {
  dt[, comb := do.call(paste, c(.SD, sep = "")), .SDcols = c("trait_id1", "trait_id2")][, freq := .N, by = .(group_id1, group_id2, comb)][freq > 1]
},
chin_dt = {
  dt[id %in% dt[, .SD[, if (.N > 1) id, by=.(trait_id1, trait_id2)], by=.(group_id1, group_id2)]$V1]
}, times = 100)

Unit: milliseconds
     expr      min       lq     mean   median       uq       max neval
 zac_tidy 4.151115 4.677328 6.150869 5.552710 7.765968  8.886388   100
   zac_dt 1.965013 2.201499 2.829999 2.640686 3.507516  3.831240   100
  chin_dt 4.567210 5.217439 6.972013 7.330628 8.233379 12.807005   100

> identical(zac_dt, chin_dt)
[1] TRUE

Comparison at 10 million 比较一千万

10 repeats: 10次​​重复:

Unit: milliseconds
     expr       min        lq      mean    median       uq       max neval
 zac_tidy 12.492261 14.169898 15.658218 14.680287 16.31024 22.062874    10
   zac_dt 10.169312 10.967292 12.425121 11.402416 12.23311 21.036535    10
  chin_dt  6.381693  6.793939  8.449424  8.033886  9.78187 12.005604    10
 chin_dt2  5.536246  6.888020  7.914103  8.310142  8.74655  9.600121    10

I'd therefore be recommending Chinsoon's method. 因此,我建议使用Chinsoon的方法。 Either works. 无论哪种。

Other answer does not solve the problem but I have found some method inspired by it. 其他答案不能解决问题,但我发现了一些受其启发的方法。 First compute the number of keys present in the group and for each trait combination keep only the one with the full number of keys 首先计算组中存在的关键点的数量,对于每个特征组合,仅保留具有完整关键点数量的关键点

 intersect_this_by2 <- function(dt,
         key_by = NULL,
         split_by = NULL,
         intersect_by = NULL){

    if (is.null(intersect_by) |
        is.null(key_by) |
        !is.data.frame(dt) |
        nrow(dt) == 0) {
        return(dt)
    }
    data_table_input <- is.data.table(dt)
    dtc <- as.data.table(dt)

    if (!is.null(split_by)) {
        # compute number of keys in the group
        dtc[, n_keys := uniqueN(.SD), by = split_by, .SDcols = key_by]
        # compute number of keys represented by each trait in each group 
        # and keep row only if they represent all keys from the group
        dtc[, keep := n_keys == uniqueN(.SD), by = c(intersect_by, split_by), .SDcols = key_by]
        dtc <- dtc[keep == TRUE][, c("n_keys", "keep") := NULL]
    } else {
        dtc[, n_keys := uniqueN(.SD), .SDcols = key_by]
        dtc[, keep := n_keys == uniqueN(.SD), by = c(intersect_by), .SDcols = key_by]
        dtc <- dtc[keep == TRUE][, c("n_keys", "keep") := NULL]
    }
    if (!data_table_input) {
        return(as.data.frame(dtc))
    } else {
        return(dtc)
    }
}

The problem is that it is much slower on my real dataset (5-6 times slower) but I think this function helps to understand the problem better,. 问题是它在我的真实数据集上要慢得多(慢5-6倍),但是我认为此功能有助于更好地理解问题。 also a dataset closer to my real one is defined below: 还定义了一个更接近我的真实数据集的数据集:

pacman::p_load(data.table, microbenchmark, testthat)

set.seed(0)
n <- 1e7
p <- 1e5
m <- 5
dt <- data.table(id = 1:n,
                 key1 = sample(1:m, size = n, replace = TRUE),
                 group_id1 = sample(1:2, size = n, replace = TRUE),
                 trait_id1 = sample(1:p, size = n, replace = TRUE),
                 group_id2 = sample(1:2, size = n, replace = TRUE),
                 trait_id2 = sample(1:2, size = n, replace = TRUE),
                 extra = sample(letters, n, replace = TRUE))
microbenchmark::microbenchmark(old_way = {res <- intersect_this_by(dt,
                                                                    key_by = c("key1"),
                                                                    split_by = c("group_id1", "group_id2"),
                                                                    intersect_by = c("trait_id1", "trait_id2"))},
                               new_way = {res <- intersect_this_by2(dt,
                                                                   key_by = c("key1"),
                                                                   split_by = c("group_id1", "group_id2"),
                                                                   intersect_by = c("trait_id1", "trait_id2"))},
                               times = 1)


Unit: seconds
    expr       min        lq      mean    median        uq       max neval
 old_way  5.891489  5.891489  5.891489  5.891489  5.891489  5.891489     1
 new_way 18.455860 18.455860 18.455860 18.455860 18.455860 18.455860     1

For info the number of rows of res in this example is 对于信息,本示例中res的行数为

> set.seed(0)
> n <- 1e7
> p <- 1e5
> m <- 5
> dt <- data.table(id = 1:n,
                   key1 = sample(1:m, size = n, replace = TRUE),
                   group_id1 = sample(1:2, size = n, replace = TRUE),
                   trait_id1 = sample(1:p, size = n, replace = TRUE),
                   group_id2 = sample(1:2, size = n, replace = TRUE),
                   trait_id2 = sample(1:2, size = n, replace = TRUE),
                   extra = sample(letters, n, replace = TRUE))
> res <- intersect_this_by(dt,
                            key_by = c("key1"),
                            split_by = c("group_id1", "group_id2"),
                            intersect_by = c("trait_id1", "trait_id2"))
> nrow(res)
[1] 7099860
> res <- intersect_this_by2(dt,
                            key_by = c("key1"),
                            split_by = c("group_id1", "group_id2"),
                            intersect_by = c("trait_id1", "trait_id2"))
> nrow(res)
[1] 7099860

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM