简体   繁体   English

如何在保留这些唯一元素的源向量的同时返回向量之间的唯一元素?

[英]How to return the unique elements between vectors while retaining the source vector of these unique elements?

For example, I have 5 vectors in a list:例如,我在一个列表中有 5 个向量:

A <- c(1,2,3,4,5)

B <- c(1,2,3,4,5,6)

C <- c(5,6,7,8,9)

D <- c(8,9)

In reality I have 100s of these vectors but I only gave 5 vectors for reproducibility.实际上我有 100 个这样的载体,但我只给出了 5 个载体以保证可重复性。 My goal is to:我的目标是:

  1. Identify the unique elements coming from the vectors.识别来自向量的独特元素。 For example, vector A shouldn't return anything because all of its elements are part of vector B , however vector B does contribute with an extra unique element and that is 6 .例如, vector A不应返回任何内容,因为它的所有元素都是vector B的一部分,但是vector B确实贡献了一个额外的唯一元素,即6 Vector C should give me 7,8,9 since c(5,6) were already included in vector B . Vector C应该给我7,8,9因为c(5,6)已经包含在vector B中。 Vector D should return nothing because all of its elements are part of C Vector D应该什么都不返回,因为它的所有元素都是 C 的一部分
  2. recognize which element is unique from which vector识别哪个元素在哪个向量中是唯一的
  3. Find which vectors are subsets of other bigger vectors.找出哪些向量是其他更大向量的子集。 For example, vector D is a subset of C and vector A is a subset of vector B .例如, vector DC的子集, vector Avector B的子集。

So far the only solution I've found was:到目前为止,我找到的唯一解决方案是:

Reduce(setdiff, list("my_vectors"))

But it doesn't allow me to recognize which element is unique from which vector.它不允许我识别哪个元素在哪个向量中是唯一的。 For example, Reduce(setdiff, list(A,B)) would return 6 , but I would have no idea where the 6 came from ( A or B )?例如, Reduce(setdiff, list(A,B))会返回6 ,但我不知道6来自哪里( AB )?

My difficulty is in this being a large scale problem, I don't have 5 vectors only, I have 100s of them so I can't figure out a sustainable solution.我的困难在于这是一个大规模的问题,我没有只有 5 个向量,我有 100 个向量,所以我无法找到一个可持续的解决方案。 Any tips are appreciated.任何提示表示赞赏。

Edit: my vectors are in a list编辑:我的向量在列表中

A first naive approach would be a for-loop, just to have a working solution.第一个天真的方法是 for 循环,只是为了有一个可行的解决方案。 The function returns a list with the unqiue elements and a dataframe, describing from which vector in the vectorList the unique elements (first appereance) are coming from. function 返回一个包含唯一元素的列表和一个 dataframe,描述唯一元素(第一次出现)来自 vectorList 中的哪个向量。

A <- c(1,2,3,4,5)
B <- c(1,2,3,4,5,6)
C <- c(5,6,7,8,9)
D <- c(8,9)

vectorList <- list(A,B,C,D)

ff <- function(vectorList) {
  uniques <- unique(vectorList[[1]])
  comingFromDf <- data.frame(values=uniques)
  comingFromDf$source <- 1
  
  for(k in 2:length(vectorList)) {
    vec <- vectorList[[k]]
    newUniques <- vec[!(vec %in% uniques)]
    if(length(newUniques)) {
      newUniques <- unique(newUniques)
      toAdd <- data.frame(values=newUniques)
      toAdd$source <- k
      comingFromDf <- rbind(comingFromDf,toAdd)
      uniques <- c(uniques,newUniques)
    }
  }
  
  list(uniqueElements = uniques,
       comingFromInfo = comingFromDf)
}

ff(vectorList)

I don't know how performant you need the function to be, but even with 200 vectors of length 1000 it seems to be quit fast (I don't know about your dimensions):我不知道你需要 function 的性能如何,但即使有 200 个长度为 1000 的向量,它似乎也很快退出(我不知道你的尺寸):

bigVectorList <- lapply(1:200, function(k) {
  sample(1:1e6,1000)
})

microbenchmark::microbenchmark(ff(bigVectorList),times=10)
#Unit: milliseconds
#              expr      min       lq     mean   median      uq      max neval
#ff(bigVectorList) 619.5148 624.8351 639.7535 633.2326 647.118 685.0387    10

On my machine, it took a bit more than half a second, maybe thats enough for you.在我的机器上,它花了半秒多一点,也许这对你来说就足够了。 Since the function only includes vectors and a dataframe, it would be quit easy to re-implement it in C++ and using Rcpp.由于 function 仅包含向量和一个 dataframe,因此在 C++ 中重新实现它并使用 Rcpp 将非常容易。 This should be much faster than the for-loop implementation in R. Moreover, you can consider using the accumulate -argument in the Reduce -function to save the intermediate calculation-results.这应该比 R 中的 for 循环实现快得多。此外,您可以考虑使用Reduce函数中的accumulate参数来保存中间计算结果。

Assume your data is stored like this:假设你的数据是这样存储的:

my_vectors <- list(
  A = c(1,2,3,4,5),
  B = c(1,2,3,4,5,6),
  C = c(5,6,7,8,9),
  D = c(8,9)
)

If you use accumulate = TRUE to the call of Reduce , you get every intermediate result as well.如果您对Reduce的调用使用accumulate = TRUE ,您也会获得每个中间结果。 We can use this together with union to build up the total set step by step (note that I set init = c() to make sure we start empty):我们可以将它与union一起使用来逐步构建总集(请注意,我设置init = c()以确保我们从空开始):

acc <- Reduce(union, my_vectors, init = c(), accumulate = T)

Then, we can take the setdiff of every item with this built-up list.然后,我们可以使用此构建列表获取每个项目的setdiff

lapply(1:length(my_vectors), function(i) setdiff(my_vectors[[i]], acc[[i]]))

This gives这给

[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 6

[[3]]
[1] 7 8 9

[[4]]
numeric(0)

You can apply the names of my_vectors later if you want.如果需要,您可以稍后应用my_vectors的名称。

Here is a tidyverse solution.这是一个tidyverse解决方案。

lag(accumulate(l, union)) keeps track of all the elements seen so far. lag(accumulate(l, union))跟踪到目前为止看到的所有元素。 The difference between this and the original list yields the newly seen elements.这个和原始列表之间的差异产生了新看到的元素。

library(tidyverse)

l <- lst(A, B, C, D)

map2(l, lag(accumulate(l, union)), setdiff)
#> $A
#> [1] 1 2 3 4 5
#> 
#> $B
#> [1] 6
#> 
#> $C
#> [1] 7 8 9
#> 
#> $D
#> numeric(0)

Here is an answer to your other question about finding which vectors are subsets of other bigger vectors. 这是您关于查找哪些向量是其他更大向量的子集的其他问题的答案。

expand_grid will get all combinations of the vectors. expand_grid将获得向量的所有组合。 Filter this to find which vector is a subset of any other vector.对此进行过滤以查找哪个向量是任何其他向量的子集。

 l %>% enframe() %>% expand_grid(a =., b =.) %>% filter( a$name,= b$name, map2_lgl(a$value, b$value. ~all(.x %in%,y)) ) %>% transmute(this_vector = a$name: is_a_subset_of_this_vector = b$name) #> # A tibble: 2 x 2 #> this_vector is_a_subset_of_this_vector #> <chr> <chr> #> 1 AB #> 2 D C

Here you only have one truly unique element which is 7 in C .这里只有一个真正独特的元素,即C中的7 The below will return the unique elements as well as their memberships下面将返回唯一元素及其成员资格

mylist <- list("A"=A,"B"=B,"C"=C,"D"=D) #better for 100's of vectors
myres <- !unlist(lapply(1:length(mylist), function(x) unlist(mylist[x]) %in% unlist(mylist[-x])))
result <- as.numeric(unlist(mylist)[myres])
member <- sapply(mylist, function(x) result %in% x)
membername <- names(mylist[member])
result
membername
> result
 7 
> membername
[1] "C"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM