[英]How to return the unique elements between vectors while retaining the source vector of these unique elements?
For example, I have 5 vectors in a list:例如,我在一个列表中有 5 个向量:
A <- c(1,2,3,4,5)
B <- c(1,2,3,4,5,6)
C <- c(5,6,7,8,9)
D <- c(8,9)
In reality I have 100s of these vectors but I only gave 5 vectors for reproducibility.实际上我有 100 个这样的载体,但我只给出了 5 个载体以保证可重复性。 My goal is to:
我的目标是:
vector A
shouldn't return anything because all of its elements are part of vector B
, however vector B
does contribute with an extra unique element and that is 6
.vector A
不应返回任何内容,因为它的所有元素都是vector B
的一部分,但是vector B
确实贡献了一个额外的唯一元素,即6
。 Vector C
should give me 7,8,9
since c(5,6)
were already included in vector B
. Vector C
应该给我7,8,9
因为c(5,6)
已经包含在vector B
中。 Vector D
should return nothing because all of its elements are part of C Vector D
应该什么都不返回,因为它的所有元素都是 C 的一部分vector D
is a subset of C
and vector A
is a subset of vector B
.vector D
是C
的子集, vector A
是vector B
的子集。 So far the only solution I've found was:到目前为止,我找到的唯一解决方案是:
Reduce(setdiff, list("my_vectors"))
But it doesn't allow me to recognize which element is unique from which vector.但它不允许我识别哪个元素在哪个向量中是唯一的。 For example,
Reduce(setdiff, list(A,B))
would return 6
, but I would have no idea where the 6
came from ( A
or B
)?例如,
Reduce(setdiff, list(A,B))
会返回6
,但我不知道6
来自哪里( A
或B
)?
My difficulty is in this being a large scale problem, I don't have 5 vectors only, I have 100s of them so I can't figure out a sustainable solution.我的困难在于这是一个大规模的问题,我没有只有 5 个向量,我有 100 个向量,所以我无法找到一个可持续的解决方案。 Any tips are appreciated.
任何提示表示赞赏。
Edit: my vectors are in a list编辑:我的向量在列表中
A first naive approach would be a for-loop, just to have a working solution.第一个天真的方法是 for 循环,只是为了有一个可行的解决方案。 The function returns a list with the unqiue elements and a dataframe, describing from which vector in the vectorList the unique elements (first appereance) are coming from.
function 返回一个包含唯一元素的列表和一个 dataframe,描述唯一元素(第一次出现)来自 vectorList 中的哪个向量。
A <- c(1,2,3,4,5)
B <- c(1,2,3,4,5,6)
C <- c(5,6,7,8,9)
D <- c(8,9)
vectorList <- list(A,B,C,D)
ff <- function(vectorList) {
uniques <- unique(vectorList[[1]])
comingFromDf <- data.frame(values=uniques)
comingFromDf$source <- 1
for(k in 2:length(vectorList)) {
vec <- vectorList[[k]]
newUniques <- vec[!(vec %in% uniques)]
if(length(newUniques)) {
newUniques <- unique(newUniques)
toAdd <- data.frame(values=newUniques)
toAdd$source <- k
comingFromDf <- rbind(comingFromDf,toAdd)
uniques <- c(uniques,newUniques)
}
}
list(uniqueElements = uniques,
comingFromInfo = comingFromDf)
}
ff(vectorList)
I don't know how performant you need the function to be, but even with 200 vectors of length 1000 it seems to be quit fast (I don't know about your dimensions):我不知道你需要 function 的性能如何,但即使有 200 个长度为 1000 的向量,它似乎也很快退出(我不知道你的尺寸):
bigVectorList <- lapply(1:200, function(k) {
sample(1:1e6,1000)
})
microbenchmark::microbenchmark(ff(bigVectorList),times=10)
#Unit: milliseconds
# expr min lq mean median uq max neval
#ff(bigVectorList) 619.5148 624.8351 639.7535 633.2326 647.118 685.0387 10
On my machine, it took a bit more than half a second, maybe thats enough for you.在我的机器上,它花了半秒多一点,也许这对你来说就足够了。 Since the function only includes vectors and a dataframe, it would be quit easy to re-implement it in C++ and using Rcpp.
由于 function 仅包含向量和一个 dataframe,因此在 C++ 中重新实现它并使用 Rcpp 将非常容易。 This should be much faster than the for-loop implementation in R. Moreover, you can consider using the
accumulate
-argument in the Reduce
-function to save the intermediate calculation-results.这应该比 R 中的 for 循环实现快得多。此外,您可以考虑使用
Reduce
函数中的accumulate
参数来保存中间计算结果。
Assume your data is stored like this:假设你的数据是这样存储的:
my_vectors <- list(
A = c(1,2,3,4,5),
B = c(1,2,3,4,5,6),
C = c(5,6,7,8,9),
D = c(8,9)
)
If you use accumulate = TRUE
to the call of Reduce
, you get every intermediate result as well.如果您对
Reduce
的调用使用accumulate = TRUE
,您也会获得每个中间结果。 We can use this together with union
to build up the total set step by step (note that I set init = c()
to make sure we start empty):我们可以将它与
union
一起使用来逐步构建总集(请注意,我设置init = c()
以确保我们从空开始):
acc <- Reduce(union, my_vectors, init = c(), accumulate = T)
Then, we can take the setdiff
of every item with this built-up list.然后,我们可以使用此构建列表获取每个项目的
setdiff
。
lapply(1:length(my_vectors), function(i) setdiff(my_vectors[[i]], acc[[i]]))
This gives这给
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 6
[[3]]
[1] 7 8 9
[[4]]
numeric(0)
You can apply the names of my_vectors
later if you want.如果需要,您可以稍后应用
my_vectors
的名称。
Here is a tidyverse
solution.这是一个
tidyverse
解决方案。
lag(accumulate(l, union))
keeps track of all the elements seen so far. lag(accumulate(l, union))
跟踪到目前为止看到的所有元素。 The difference between this and the original list yields the newly seen elements.这个和原始列表之间的差异产生了新看到的元素。
library(tidyverse)
l <- lst(A, B, C, D)
map2(l, lag(accumulate(l, union)), setdiff)
#> $A
#> [1] 1 2 3 4 5
#>
#> $B
#> [1] 6
#>
#> $C
#> [1] 7 8 9
#>
#> $D
#> numeric(0)
expand_grid
will get all combinations of the vectors. expand_grid
将获得向量的所有组合。 Filter this to find which vector is a subset of any other vector.对此进行过滤以查找哪个向量是任何其他向量的子集。
l %>% enframe() %>% expand_grid(a =., b =.) %>% filter( a$name,= b$name, map2_lgl(a$value, b$value. ~all(.x %in%,y)) ) %>% transmute(this_vector = a$name: is_a_subset_of_this_vector = b$name) #> # A tibble: 2 x 2 #> this_vector is_a_subset_of_this_vector #> <chr> <chr> #> 1 AB #> 2 D C
Here you only have one truly unique element which is 7
in C
.这里只有一个真正独特的元素,即
C
中的7
。 The below will return the unique elements as well as their memberships下面将返回唯一元素及其成员资格
mylist <- list("A"=A,"B"=B,"C"=C,"D"=D) #better for 100's of vectors
myres <- !unlist(lapply(1:length(mylist), function(x) unlist(mylist[x]) %in% unlist(mylist[-x])))
result <- as.numeric(unlist(mylist)[myres])
member <- sapply(mylist, function(x) result %in% x)
membername <- names(mylist[member])
result
membername
> result
7
> membername
[1] "C"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.