I am in need of some wisdom!
I have two data frames, like:
test1 <- data.frame( let = replicate( 100, paste(sample(LETTERS[1:12] , 3 ) , collapse ="") ) , num = sample( 1:500 , 100 , replace = FALSE ))
test2 <- data.frame( let = replicate( 100, paste(sample(LETTERS[13:26] , 4 ) , collapse ="") ) , num = sample( 1:500 , 100 , replace = FALSE ))
head( test1 )
# let num
# 1 KDA 430
# 2 IHB 41
# 3 GAB 473
# 4 HKJ 335
# 5 LCK 261
# 6 EIK 500
head( test2 )
# let num
# 1 ZUYW 153
# 2 PRNW 263
# 3 OTQS 355
# 4 NYRW 87
# 5 ZYST 365
# 6 TXRN 287
Now, I want to paste all combinations of strings from test1 (ie test1$let) with all combinations of strings from test2, but only when difference test1$num and test2$num is <= 100.
One way to do this is:
test.merg <- NULL
i <- 1; j <- 1
for(i in 1:dim(test1)[1] ) {
for( j in 1:dim(test2)[1] ) {
if( abs( test1[i,]$num - test2[j,]$num ) <= 100 ){
test.merg <- c(test.merg ,paste( test1[i,]$let , test2[j,]$let , sep="." ) )
}
j <- j+ 1
}
i <- i+ 1
}
head(test.merg)
#[1] "KDA.OTQS" "KDA.ZYST" "KDA.TVRX" "KDA.VYRQ" "KDA.XRQS" "KDA.WSUR"
This works fine, but of course my actual dataset is different and huge, and it takes a long time to do this. I am sure that there must be a more efficient way of doing this. Tried using the apply family functions, but the only way I could think of using them is:
test1.1 <- paste( test1$let , test1$num ,sep = "_")
test2.1 <- paste( test2$let , test2$num ,sep = "_")
test.merg.1 <- unlist(lapply( test1.1 , FUN = function(x) {lapply(
test2.1 , FUN = function(y) {
if( abs( as.numeric( str_split_fixed( x , "_" , 2 )[,2] ) - as.numeric( str_split_fixed( y , "_" , 2 )[,2]) ) <= 100){
paste( str_split_fixed(x , "_" , 2 )[,1] , str_split_fixed(y , "_" , 2 )[,1], sep = ".")
}
})
})
)
head(test.merg.1)
# [1] "KDA.OTQS" "KDA.ZYST" "KDA.TVRX" "KDA.VYRQ" "KDA.XRQS" "KDA.WSUR"
This already reduces the time taken by quite a lot, to almost 1/4th , but it would be great if it can be made more efficient. Not to mention, if there is a completely different and better way of doing this then it will be fantastic.
Thank you!
something like this ?
Note: if your dataset is really "huge" as you say, your computer will not like that, but if you want every possible combination I don't see any other way.
res <- merge(test1 %>% rename_all(paste0,1),
test2 %>% rename_all(paste0,2)) %>%
filter(abs(num1-num2) <= 100) %>%
mutate(str = paste(let1,let2,sep="_"))
# let1 num1 let2 num2 str
# 1 DJE 82 VNQU 181 DJE_VNQU
# 2 JLE 238 VNQU 181 JLE_VNQU
# 3 EGI 220 VNQU 181 EGI_VNQU
# 4 KED 130 VNQU 181 KED_VNQU
# 5 CJF 81 VNQU 181 CJF_VNQU
# 6 KCH 235 VNQU 181 KCH_VNQU
# ...
head(res$str)
#[1] "DJE_VNQU" "JLE_VNQU" "EGI_VNQU" "KED_VNQU" "CJF_VNQU" "KCH_VNQU"
A combination of outer
statements work here
outer(test1$let, test2$let, "paste", sep=".")[abs(outer(test1$num, test2$num, "-")) <= 100]
# [1] "DEF.VOXZ" "FHJ.VOXZ" "CHB.VOXZ" "JBH.VOXZ" etc
Reproducible data
set.seed(1)
test1 <- data.frame( let = replicate( 100, paste(sample(LETTERS[1:12] , 3 ) , collapse ="") ) , num = sample( 1:500 , 100 , replace = FALSE ))
test2 <- data.frame( let = replicate( 100, paste(sample(LETTERS[13:26] , 4 ) , collapse ="") ) , num = sample( 1:500 , 100 , replace = FALSE ))
Benchmark
OP <- function() {
test.merg <- NULL
i <- 1; j <- 1
for(i in 1:dim(test1)[1] ) {
for( j in 1:dim(test2)[1] ) {
if( abs( test1[i,]$num - test2[j,]$num ) <= 100 ){
test.merg <- c(test.merg ,paste( test1[i,]$let , test2[j,]$let , sep="." ) )
}
j <- j+ 1
}
i <- i+ 1
}
head(test.merg)
}
myfun <- function() {
outer(test1$let, test2$let, "paste", sep=".")[abs(outer(test1$num, test2$num, "-")) <= 100]
}
library(microbenchmark)
microbenchmark(OP(), myfun(), times=10L)
Unit: milliseconds
expr min lq mean median uq max neval
OP() 4877.0017 4928.447303 5014.859718 5017.653519 5056.110679 5236.55990 10
myfun() 5.8398 5.951762 8.501438 6.709145 7.842536 25.16273 10
It's almost 500x faster
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.