简体   繁体   中英

Faster way to achieve filtering for all possible combinations

Consider I have a data frame like this,

set.seed(1)

q<-100

df <- data.frame(Var1 = round(runif(q,1,50)),
                    Var2 = round(runif(q,1,50)),
                        Var3 = round(runif(q,1,50)),
                            Var4 = round(runif(q,1,50)))
attach(df)

As you realized, q is standing for setting the length of the each columns in the dataframe.

I want to make a filtering of all possible combinations of the columns. It can be anything. Let's say I am seeking for if the devision of the sums of the first two columns and the sums of the last two columns greater than 1 or not.

One thing to achieve that, using expand.grid() function.

a <- Sys.time()

expanded <- expand.grid(Var1, Var2, Var3, Var4)

Sys.time() - a

Time difference of 8.31997 secs


expanded  <- expanded[rowSums(expanded[,1:2])/ rowSums(expanded[,3:4])>1,]

However it takes a lot time! To make it faster, I tried to follow the answer with rep.int() function in this question and designed my own function.

myexpand <- function(...) {

 sapply(list(...),function(y) rep.int(y, prod(lengths(list(...)))/length(y)))

}

But it is not so promising again. It takes more time comparing to my expectation and the expand.grid also.And, If I set a greater q , it becomes a nigthmare!

Is there a proper way to achieve this a lot faster (1-2 seconds) with maybe matrix operations before applying expand.grid or myexpand . And, I wonder if it is a weakness of using an interpreted language like R... Software suggestions are also acceptable.

For this particular condition (ie ratio of sums > 1), you might want to consider using the data.table package:

system.time({
    #generate permutations of Var1 & Var2 and Var3 & Var4
    DT12 <- DT[, CJ(Var1=Var1, Var2=Var2, unique=TRUE)][, s12 := Var1 + Var2]
    DT34 <- DT[, CJ(Var3=Var3, Var4=Var4, unique=TRUE)][, s34 := Var3 + Var4]

    #perform a non-equi join
    DT12[DT34, on=.(s12>s34), allow.cartesian=TRUE,
        .(Var1=x.Var1, Var2=x.Var2, Var3=i.Var3, Var4=i.Var4), by=.EACHI][, s12:=NULL]
})

timing:

   user  system elapsed 
   0.02    0.06    0.08 

output:

         Var1 Var2 Var3 Var4
      1:    2    5    2    4
      2:    4    3    2    4
      3:    5    2    2    4
      4:    2    6    2    4
      5:    4    4    2    4
     ---                    
1753416:   50   49   49   48
1753417:   50   50   49   48
1753418:   50   49   49   49
1753419:   50   50   49   49
1753420:   50   50   49   50

data:

library(data.table)
set.seed(1)
q <- 100
DT <- data.table(Var1 = round(runif(q,1,50)),
    Var2 = round(runif(q,1,50)),
    Var3 = round(runif(q,1,50)),
    Var4 = round(runif(q,1,50)))

edit: For summing of positive numbers, you can prob use the following (caveat: it will not be faster than using a Rcpp approach).

system.time({
    S <- DT[, .(UB=90 - Var1, C1=Var1)]
    for (k in 2:4) {
        S <- DT[S, on=paste0("Var", k, "<UB"), allow.cartesian=TRUE,
            mget(c(names(S), paste0("x.Var", k)))]
        setnames(S, paste0("x.Var", k), paste0("C", k))
        S[, UB := UB - get(paste0("C",k))]
    }
    S[, UB := NULL][rowSums(S)>30L]
})

timing:

   user  system elapsed 
   3.48    4.06    3.51 

output, S :

> S
          C1 C2 C3 C4
       1: 14 33 14  6
       2: 14 33 14 25
       3: 14 33 14 24
       4: 14 33 14 19
       5: 14 33 14 10
      ---            
34914725: 31 39  3  8
34914726: 31 39  3  8
34914727: 31 39  3  9
34914728: 31 39  3 16
34914729: 31 39  3  8

How about using tidyr and dplyr ?

library(tidyr)
library(dplyr)

expanded <- 
  df %>% 
  expand(Var1, Var2, Var3, Var4)
  mutate(res = (Var1 + Var2)/(Var3 + Var4)) %>%
  filter(res > 1)

Note that expand works slightly differently from expand.grid - it only keeps the unique rows - so this may not be what you need.

On my computer, this all takes less than a second.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM