简体   繁体   中英

Need a more efficient threshold matching with function for R

Not sure how best to ask this question, so feel free to edit the question title if there is a more standard vocabulary to use here.

I have two 2-column data tables in R, the first is a list of unique 2-variable values (u), so much shorter than the second, which is a raw list of similar values (d). I need a function that will, for every 2-variable set of values in u, find all the 2-variable sets of values in d for which both variables are within a given threshold.

Here's a minimal example. Actual data is much larger (see below, as this is the problem) and (obviously) not created randomly as in the example. In the actual data, u would have about 600,000 to 1,000,000 values (rows) and d would have upwards of 10,000,000 rows.

# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
                PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))

# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
                PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)

# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1

So, my first attempt to do this was with a for loop for all values of u. This works nicely, but is computationally intensive and takes quite a while to process the actual data.

# Make a list to output the list of within-threshold  rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
  for(i in 1:nrow(u)){
      m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
  } 
})
m

That works. But I thought using a function with apply() would be more efficient. Which it is...

# Make the user-defined function for the threshold matching
match <- function(x,...){
  which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
  m <- apply(u,1,match)
})

Again, this apply function works and is slightly faster than the for loop, but only marginally. This may simply be a big data problem for which I need a bit more computing power (or more time!). But I thought others might have thoughts on a sneaky command or function syntax that would dramatically speed this up. Outside the box approaches to finding these matching rows also welcome.

Somewhat sneaky:

library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)

Should be fast once the number of rows is sufficiently large. We multiply by 100 to get into integer space. Reversing the order of the arguments to findOverlaps could improve performance.

Alas, this seems only slightly faster than the for loop

unlist(Map(function(x,y) {
    which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))

but at least it's something.

I have a cunning plan :-) . How about just doing calculations:

> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
           [,1]      [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
     [,1] [,2]
[1,]   -1    1
[2,]    1    1
[3,]   -1    1
[4,]    1   -1
[5,]    1    1

Now all you have to do is select the rows of that last matrix which are c(-1,1) , using which , and you can easily extract the desired rows from your bar matrix. Repeat for each row in foo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM