简体   繁体   English

保留其元素在同一列表中没有适当子集的向量(来自向量列表)(使用 RCPP)

[英]Keeping vectors (from list of vectors) whose elements do not have a proper subset within that same list (using RCPP)

I have asked this question previously (see here) and received a satisfactory answer using the purr package.我之前问过这个问题(见这里)并使用purr package 得到了满意的答案。 However, this has proved to be a bottle neck in my program so I would like to rewrite the section using the RCPP package.然而,这已被证明是我程序中的一个瓶颈,所以我想使用RCPP package 重写该部分。

Proper subset : A proper subset S' of a set S is a subset that is strictly contained in S and so excludes S itself (note I am also excluding the empty set).真子集:集合 S 的真子集 S' 是严格包含在 S 中的子集,因此排除了 S 本身(注意我也排除了空集)。

Suppose you have the following vectors in a list:假设您在列表中有以下向量:

a = c(1,2)
b = c(1,3)
c = c(2,4)
d = c(1,2,3,4)
e = c(2,4,5)
f = c(1,2,3)

My aim is to keep only vectors which have no proper subset within the list, which in this example would be a, b and c.我的目标是只保留列表中没有适当子集的向量,在本例中为 a、b 和 c。

Previous Solution以前的解决方案

library(purr)

possibilities <- list(a,b,c,d,e,f)
keep(possibilities,
     map2_lgl(.x = possibilities,
              .y = seq_along(possibilities),
              ~ !any(map_lgl(possibilities[-.y], function(z) all(z %in% .x)))))

The notion here is to avoid the O(N^3) and use a less order instead.这里的概念是避免 O(N^3) 并使用较少的顺序。 The other answer provided here will be slow still since it is greater than O(N^2).这里提供的另一个答案仍然很慢,因为它大于 O(N^2)。 Here is a solution with less than O(N^2), where the worst case scenario is O(N^2) when all the elements are unique.这是一个小于 O(N^2) 的解决方案,当所有元素都是唯一的时,最坏的情况是 O(N^2)。

onlySet <- function(x){
   i <- 1
  repeat{
    y <- sapply(x[-1], function(el)!all(is.element(x[[1]], el)))
    if(all(y)){
      if(i==length(x)) break
      else i <- i+1
    }
    x <- c(x[-1][y], x[1])
  }
  x
}

Now to show the time difference, check out the following:现在要显示时差,请查看以下内容:

match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
method1 <- function(a){
 mat <- outer(a, a, match_fun)
 a[colSums(mat) == 1]
}

poss <- rep(possibilities, 100)

microbenchmark::microbenchmark(method1(poss), onlySet(poss))

Unit: milliseconds
          expr      min        lq       mean    median        uq       max neval cld
 method1(poss) 840.7919 880.12635 932.255030 889.36380 923.32555 1420.1077   100   b
 onlySet(poss)   1.9845   2.07005   2.191647   2.15945   2.24245    3.3656   100  a 

Have you tried optimising the solution in base R first?您是否尝试过首先优化基础 R 中的解决方案? For example, the following reproduces your expected output and uses (faster) base R array routines:例如,以下复制了您预期的 output 并使用(更快)基本 R 数组例程:

match_fun <- Vectorize(function(s1, s2) all(s1 %in% s2))
mat <- outer(possibilities, possibilities, match_fun)
possibilities[colSums(mat) == 1]
#[[1]]
#[1] 1 2
#
#[[2]]
#[1] 1 3
#
#[[3]]
#[1] 2 4

Inspired by Onyambu's performant solution, here is another base R option using a recursive function受 Onyambu 高性能解决方案的启发,这里是另一个使用递归 function 的基本 R 选项

f_recursive <- function(x, i = 1) {
    if (i > length(x)) return(x)
    idx <- which(sapply(x[-i], function(el) all(x[[i]] %in% el))) + 1
    if (length(idx) == 0) f_recursive(x, i + 1) else f_recursive(x[-idx], i + 1)
}
f(possibilities)

The performance is on par with Onyambu's solution.性能与 Onyambu 的解决方案相当。

poss <- rep(possibilities, 100)
microbenchmark::microbenchmark(
    method1(poss),
    onlySet(poss),
    f_recursive(poss))
#Unit: milliseconds
#              expr        min         lq       mean     median         uq
#     method1(poss) 682.558602 710.974831 750.325377 730.627996 765.040976
#     onlySet(poss)   1.700646   1.782713   1.870972   1.819820   1.918669
# f_recursive(poss)   1.681120   1.737459   1.884685   1.806384   1.901582
#         max neval
# 1200.562889   100
#    2.371646   100
#    3.217013   100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM