I have two large vectors, for example:
set.seed(17)
vec1 <- paste0(sample(1:10, 10000000, replace = T), "_", sample(1:1000000000, 10000000))
vec2 <- paste0(sample(1:10, 1000000, replace = T), "_", sample(1:1000000000, 1000000))
And I need to identify the proportion of elements in vec2 that are also in vec1. I am currently using:
system.time({
prop <- table(vec2 %in% vec1)[[2]]/length(vec2)
})
However, the actual vectors I am applying this to are VERY large (up to ~2,000,000,000 elements), so performance is very important. Is anyone able to suggest how I can decrease the run-time?
Here are some options with timings, also using @Sotos and @Henrik's suggestion from comments for comparison purposes.
library(microbenchmark)
library(data.table)
microbenchmark(a1 = table(vec2 %in% vec1)[[2]]/length(vec2) ,
a2 = sum(vec2 %in% vec1)/length(vec2),
a3 = sum(!is.na(match(vec2, vec1)))/length(vec2),
a4 = length(intersect(vec2, vec1)) / length(vec2),
a5 = sum(vec2 %chin% vec1)/length(vec2))
#Unit: milliseconds
# expr min lq mean median uq max neval
# a1 1269.84 1340.468 1667.251 1410.252 2191.750 2535.723 100
# a2 1022.26 1086.938 1284.692 1124.565 1152.516 2286.028 100
# a3 1023.59 1125.517 1387.592 1148.337 1852.645 3849.555 100
# a4 1022.84 1088.056 1291.582 1122.846 1173.768 2277.901 100
# a5 449.19 453.146 462.781 454.365 458.178 620.996 100
Clearly, Henrik's solution is the fastest.
data
set.seed(17)
vec1 <- paste0(sample(1:10, 10000000, replace = T), "_",
sample(1:1000000000, 10000000))
vec2 <- paste0(sample(1:10, 1000000, replace = T), "_",
sample(1:1000000000, 1000000))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.