How to get the proportion of elements that match between two vectors?

Question

I have two large vectors, for example:

set.seed(17)

vec1 <- paste0(sample(1:10, 10000000, replace = T), "_", sample(1:1000000000, 10000000))

vec2 <- paste0(sample(1:10, 1000000, replace = T), "_", sample(1:1000000000, 1000000))

And I need to identify the proportion of elements in vec2 that are also in vec1. I am currently using:

system.time({ 

prop <- table(vec2 %in% vec1)[[2]]/length(vec2) 

})

However, the actual vectors I am applying this to are VERY large (up to ~2,000,000,000 elements), so performance is very important. Is anyone able to suggest how I can decrease the run-time?

Answer 1

Here are some options with timings, also using @Sotos and @Henrik's suggestion from comments for comparison purposes.

library(microbenchmark)
library(data.table)

microbenchmark(a1 = table(vec2 %in% vec1)[[2]]/length(vec2) , 
               a2 = sum(vec2 %in% vec1)/length(vec2), 
               a3 = sum(!is.na(match(vec2, vec1)))/length(vec2), 
               a4 = length(intersect(vec2, vec1)) / length(vec2), 
               a5 = sum(vec2 %chin% vec1)/length(vec2))

#Unit: milliseconds
# expr     min       lq     mean   median       uq      max neval
#   a1 1269.84 1340.468 1667.251 1410.252 2191.750 2535.723   100
#   a2 1022.26 1086.938 1284.692 1124.565 1152.516 2286.028   100
#   a3 1023.59 1125.517 1387.592 1148.337 1852.645 3849.555   100
#   a4 1022.84 1088.056 1291.582 1122.846 1173.768 2277.901   100
#   a5  449.19  453.146  462.781  454.365  458.178  620.996   100

Clearly, Henrik's solution is the fastest.

data

set.seed(17)
vec1 <- paste0(sample(1:10, 10000000, replace = T), "_", 
               sample(1:1000000000, 10000000))
vec2 <- paste0(sample(1:10, 1000000, replace = T), "_", 
               sample(1:1000000000, 1000000))

How to get the proportion of elements that match between two vectors?

Question

1 answers

solution1
0 ACCPTED 2019-05-02 07:47:02

How to get the proportion of elements that match between two vectors?

Question

1 answers

solution1 0 ACCPTED 2019-05-02 07:47:02

solution1
0 ACCPTED 2019-05-02 07:47:02