如何在保留这些唯一元素的源向量的同时返回向量之间的唯一元素？

Question

For example, I have 5 vectors in a list:例如，我在一个列表中有 5 个向量：

A <- c(1,2,3,4,5)

B <- c(1,2,3,4,5,6)

C <- c(5,6,7,8,9)

D <- c(8,9)

In reality I have 100s of these vectors but I only gave 5 vectors for reproducibility.实际上我有 100 个这样的载体，但我只给出了 5 个载体以保证可重复性。 My goal is to:我的目标是：

Identify the unique elements coming from the vectors.识别来自向量的独特元素。 For example, vector A shouldn't return anything because all of its elements are part of vector B , however vector B does contribute with an extra unique element and that is 6 .例如， vector A不应返回任何内容，因为它的所有元素都是vector B的一部分，但是vector B确实贡献了一个额外的唯一元素，即6 。 Vector C should give me 7,8,9 since c(5,6) were already included in vector B . Vector C应该给我7,8,9因为c(5,6)已经包含在vector B中。 Vector D should return nothing because all of its elements are part of C Vector D应该什么都不返回，因为它的所有元素都是 C 的一部分
recognize which element is unique from which vector识别哪个元素在哪个向量中是唯一的
Find which vectors are subsets of other bigger vectors.找出哪些向量是其他更大向量的子集。 For example, vector D is a subset of C and vector A is a subset of vector B .例如， vector D是C的子集， vector A是vector B的子集。

So far the only solution I've found was:到目前为止，我找到的唯一解决方案是：

Reduce(setdiff, list("my_vectors"))

But it doesn't allow me to recognize which element is unique from which vector.但它不允许我识别哪个元素在哪个向量中是唯一的。 For example, Reduce(setdiff, list(A,B)) would return 6 , but I would have no idea where the 6 came from ( A or B )?例如， Reduce(setdiff, list(A,B))会返回6 ，但我不知道6来自哪里（ A或B ）？

My difficulty is in this being a large scale problem, I don't have 5 vectors only, I have 100s of them so I can't figure out a sustainable solution.我的困难在于这是一个大规模的问题，我没有只有 5 个向量，我有 100 个向量，所以我无法找到一个可持续的解决方案。 Any tips are appreciated.任何提示表示赞赏。

Edit: my vectors are in a list编辑：我的向量在列表中

Answer 1

A first naive approach would be a for-loop, just to have a working solution.第一个天真的方法是 for 循环，只是为了有一个可行的解决方案。 The function returns a list with the unqiue elements and a dataframe, describing from which vector in the vectorList the unique elements (first appereance) are coming from. function 返回一个包含唯一元素的列表和一个 dataframe，描述唯一元素（第一次出现）来自 vectorList 中的哪个向量。

A <- c(1,2,3,4,5)
B <- c(1,2,3,4,5,6)
C <- c(5,6,7,8,9)
D <- c(8,9)

vectorList <- list(A,B,C,D)

ff <- function(vectorList) {
  uniques <- unique(vectorList[[1]])
  comingFromDf <- data.frame(values=uniques)
  comingFromDf$source <- 1
  
  for(k in 2:length(vectorList)) {
    vec <- vectorList[[k]]
    newUniques <- vec[!(vec %in% uniques)]
    if(length(newUniques)) {
      newUniques <- unique(newUniques)
      toAdd <- data.frame(values=newUniques)
      toAdd$source <- k
      comingFromDf <- rbind(comingFromDf,toAdd)
      uniques <- c(uniques,newUniques)
    }
  }
  
  list(uniqueElements = uniques,
       comingFromInfo = comingFromDf)
}

ff(vectorList)

I don't know how performant you need the function to be, but even with 200 vectors of length 1000 it seems to be quit fast (I don't know about your dimensions):我不知道你需要 function 的性能如何，但即使有 200 个长度为 1000 的向量，它似乎也很快退出（我不知道你的尺寸）：

bigVectorList <- lapply(1:200, function(k) {
  sample(1:1e6,1000)
})

microbenchmark::microbenchmark(ff(bigVectorList),times=10)
#Unit: milliseconds
#              expr      min       lq     mean   median      uq      max neval
#ff(bigVectorList) 619.5148 624.8351 639.7535 633.2326 647.118 685.0387    10

On my machine, it took a bit more than half a second, maybe thats enough for you.在我的机器上，它花了半秒多一点，也许这对你来说就足够了。 Since the function only includes vectors and a dataframe, it would be quit easy to re-implement it in C++ and using Rcpp.由于 function 仅包含向量和一个 dataframe，因此在 C++ 中重新实现它并使用 Rcpp 将非常容易。 This should be much faster than the for-loop implementation in R. Moreover, you can consider using the accumulate -argument in the Reduce -function to save the intermediate calculation-results.这应该比 R 中的 for 循环实现快得多。此外，您可以考虑使用Reduce函数中的accumulate参数来保存中间计算结果。

Answer 2

Assume your data is stored like this:假设你的数据是这样存储的：

my_vectors <- list(
  A = c(1,2,3,4,5),
  B = c(1,2,3,4,5,6),
  C = c(5,6,7,8,9),
  D = c(8,9)
)

If you use accumulate = TRUE to the call of Reduce , you get every intermediate result as well.如果您对Reduce的调用使用accumulate = TRUE ，您也会获得每个中间结果。 We can use this together with union to build up the total set step by step (note that I set init = c() to make sure we start empty):我们可以将它与union一起使用来逐步构建总集（请注意，我设置init = c()以确保我们从空开始）：

acc <- Reduce(union, my_vectors, init = c(), accumulate = T)

Then, we can take the setdiff of every item with this built-up list.然后，我们可以使用此构建列表获取每个项目的setdiff 。

lapply(1:length(my_vectors), function(i) setdiff(my_vectors[[i]], acc[[i]]))

This gives这给

[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 6

[[3]]
[1] 7 8 9

[[4]]
numeric(0)

You can apply the names of my_vectors later if you want.如果需要，您可以稍后应用my_vectors的名称。

Answer 3

Here is a tidyverse solution.这是一个tidyverse解决方案。

lag(accumulate(l, union)) keeps track of all the elements seen so far. lag(accumulate(l, union))跟踪到目前为止看到的所有元素。 The difference between this and the original list yields the newly seen elements.这个和原始列表之间的差异产生了新看到的元素。

library(tidyverse)

l <- lst(A, B, C, D)

map2(l, lag(accumulate(l, union)), setdiff)
#> $A
#> [1] 1 2 3 4 5
#> 
#> $B
#> [1] 6
#> 
#> $C
#> [1] 7 8 9
#> 
#> $D
#> numeric(0)

Here is an answer to your other question about finding which vectors are subsets of other bigger vectors. 这是您关于查找哪些向量是其他更大向量的子集的其他问题的答案。

expand_grid will get all combinations of the vectors. expand_grid将获得向量的所有组合。 Filter this to find which vector is a subset of any other vector.对此进行过滤以查找哪个向量是任何其他向量的子集。

 l %>% enframe() %>% expand_grid(a =., b =.) %>% filter( a$name,= b$name, map2_lgl(a$value, b$value. ~all(.x %in%,y)) ) %>% transmute(this_vector = a$name: is_a_subset_of_this_vector = b$name) #> # A tibble: 2 x 2 #> this_vector is_a_subset_of_this_vector #> <chr> <chr> #> 1 AB #> 2 D C

Answer 4

Here you only have one truly unique element which is 7 in C .这里只有一个真正独特的元素，即C中的7 。 The below will return the unique elements as well as their memberships下面将返回唯一元素及其成员资格

mylist <- list("A"=A,"B"=B,"C"=C,"D"=D) #better for 100's of vectors
myres <- !unlist(lapply(1:length(mylist), function(x) unlist(mylist[x]) %in% unlist(mylist[-x])))
result <- as.numeric(unlist(mylist)[myres])
member <- sapply(mylist, function(x) result %in% x)
membername <- names(mylist[member])
result
membername
> result
 7 
> membername
[1] "C"

如何在保留这些唯一元素的源向量的同时返回向量之间的唯一元素？

问题描述

4 个解决方案

解决方案1
1 2020-11-13 12:16:56

解决方案2
1 2020-11-13 12:22:37

解决方案3
1 2020-11-13 12:26:19

解决方案4
0 2020-11-13 13:01:19

如何在保留这些唯一元素的源向量的同时返回向量之间的唯一元素？

问题描述

4 个解决方案

解决方案1 1 2020-11-13 12:16:56

解决方案2 1 2020-11-13 12:22:37

解决方案3 1 2020-11-13 12:26:19

解决方案4 0 2020-11-13 13:01:19

解决方案1
1 2020-11-13 12:16:56

解决方案2
1 2020-11-13 12:22:37

解决方案3
1 2020-11-13 12:26:19

解决方案4
0 2020-11-13 13:01:19