简体   繁体   中英

Pasting together all combinations of a column from one data.frame with all combinations of a column of another data.frame based on a condition

I am in need of some wisdom!

I have two data frames, like:

test1 <- data.frame( let = replicate( 100, paste(sample(LETTERS[1:12] , 3 ) , collapse ="")  ) , num  = sample( 1:500 , 100 , replace = FALSE ))
test2 <- data.frame( let = replicate( 100, paste(sample(LETTERS[13:26] , 4 ) , collapse ="")  ) , num  = sample( 1:500 , 100 , replace = FALSE ))

head( test1 )
#   let num
# 1 KDA 430
# 2 IHB  41
# 3 GAB 473
# 4 HKJ 335
# 5 LCK 261
# 6 EIK 500

head( test2 )
#   let num
# 1 ZUYW 153
# 2 PRNW 263
# 3 OTQS 355
# 4 NYRW  87
# 5 ZYST 365
# 6 TXRN 287

Now, I want to paste all combinations of strings from test1 (ie test1$let) with all combinations of strings from test2, but only when difference test1$num and test2$num is <= 100.

One way to do this is:

test.merg <- NULL
i <- 1; j <- 1
for(i in 1:dim(test1)[1] ) {
  for( j in 1:dim(test2)[1]  ) { 
    if( abs( test1[i,]$num - test2[j,]$num ) <= 100  ){
      test.merg <- c(test.merg ,paste( test1[i,]$let , test2[j,]$let , sep="."   ) )
      }
    j <- j+ 1
    }
  i <- i+ 1
}
head(test.merg)
#[1] "KDA.OTQS" "KDA.ZYST" "KDA.TVRX" "KDA.VYRQ" "KDA.XRQS" "KDA.WSUR"

This works fine, but of course my actual dataset is different and huge, and it takes a long time to do this. I am sure that there must be a more efficient way of doing this. Tried using the apply family functions, but the only way I could think of using them is:

test1.1 <- paste( test1$let , test1$num ,sep = "_")
test2.1 <- paste( test2$let , test2$num ,sep = "_")

test.merg.1 <- unlist(lapply( test1.1 , FUN = function(x) {lapply( 
  test2.1 , FUN = function(y) {
    if( abs( as.numeric( str_split_fixed( x , "_" , 2 )[,2] )  - as.numeric( str_split_fixed( y , "_" , 2 )[,2]) ) <= 100){ 
      paste( str_split_fixed(x , "_" , 2 )[,1] , str_split_fixed(y , "_" , 2 )[,1], sep = ".")
  }
})
})
)

head(test.merg.1)
# [1] "KDA.OTQS" "KDA.ZYST" "KDA.TVRX" "KDA.VYRQ" "KDA.XRQS" "KDA.WSUR"

This already reduces the time taken by quite a lot, to almost 1/4th , but it would be great if it can be made more efficient. Not to mention, if there is a completely different and better way of doing this then it will be fantastic.

Thank you!

something like this ?

Note: if your dataset is really "huge" as you say, your computer will not like that, but if you want every possible combination I don't see any other way.

res <- merge(test1 %>% rename_all(paste0,1),
             test2 %>% rename_all(paste0,2)) %>%
  filter(abs(num1-num2) <= 100) %>%
  mutate(str = paste(let1,let2,sep="_"))
#    let1 num1 let2 num2      str
# 1  DJE   82 VNQU  181 DJE_VNQU
# 2  JLE  238 VNQU  181 JLE_VNQU
# 3  EGI  220 VNQU  181 EGI_VNQU
# 4  KED  130 VNQU  181 KED_VNQU
# 5  CJF   81 VNQU  181 CJF_VNQU
# 6  KCH  235 VNQU  181 KCH_VNQU
# ...

head(res$str)
#[1] "DJE_VNQU" "JLE_VNQU" "EGI_VNQU" "KED_VNQU" "CJF_VNQU" "KCH_VNQU"

A combination of outer statements work here

outer(test1$let, test2$let, "paste", sep=".")[abs(outer(test1$num, test2$num, "-")) <= 100]

# [1] "DEF.VOXZ" "FHJ.VOXZ" "CHB.VOXZ" "JBH.VOXZ" etc

Reproducible data

set.seed(1)
test1 <- data.frame( let = replicate( 100, paste(sample(LETTERS[1:12] , 3 ) , collapse ="")  ) , num  = sample( 1:500 , 100 , replace = FALSE ))
test2 <- data.frame( let = replicate( 100, paste(sample(LETTERS[13:26] , 4 ) , collapse ="")  ) , num  = sample( 1:500 , 100 , replace = FALSE ))

Benchmark

OP <- function() {
test.merg <- NULL
i <- 1; j <- 1
for(i in 1:dim(test1)[1] ) {
  for( j in 1:dim(test2)[1]  ) { 
    if( abs( test1[i,]$num - test2[j,]$num ) <= 100  ){
      test.merg <- c(test.merg ,paste( test1[i,]$let , test2[j,]$let , sep="."   ) )
      }
    j <- j+ 1
    }
  i <- i+ 1
}
head(test.merg)
}

myfun <- function() {
outer(test1$let, test2$let, "paste", sep=".")[abs(outer(test1$num, test2$num, "-")) <= 100]
}

library(microbenchmark)
microbenchmark(OP(), myfun(), times=10L)

Unit: milliseconds
    expr       min          lq        mean      median          uq        max neval
    OP() 4877.0017 4928.447303 5014.859718 5017.653519 5056.110679 5236.55990    10
 myfun()    5.8398    5.951762    8.501438    6.709145    7.842536   25.16273    10

It's almost 500x faster

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM