需要與R函數更有效的閾值匹配

Question

不確定如何最好地問這個問題，如果這里有更標准的詞匯，請隨時編輯問題標題。

我在R中有兩個2列數據表，第一個是唯一2變量值（u）的列表，比第二個短得多，第二個是相似值（d）的原始列表。 我需要一個函數，對於u中的每個2變量值集，找到d中所有兩個變量均在給定閾值內的2個變量集。

這是一個最小的例子。 實際數據要大得多（請參見下文，因為這是問題所在），並且（顯然）並非如示例中那樣隨機創建。 在實際數據中，u將具有約600,000至1,000,000的值（行），而d將具有10,000,000以上的行。

# First create the table of unique variable pairs (no 2-column duplicates)
u <- data.frame(PC1=c(-1.10,-1.01,-1.13,-1.18,-1.12,-0.82),
                PC2=c(-1.63,-1.63,-1.81,-1.86,-1.86,-1.77))

# Now, create the set of raw 2-variable pairs, which may include duplicates
d <- data.frame(PC1=sample(u$PC1,100,replace=T)*sample(90:100,100,replace=T)/100,
                PC2=sample(u$PC2,100,replace=T)*sample(90:100,100,replace=T)/100)

# Set the threshold that defined a 'close-enough' match between u and d values
b <- 0.1

因此，我的第一個嘗試是對所有u值進行一個for循環。 這很好用，但計算量大，並且需要相當長的時間來處理實際數據。

# Make a list to output the list of within-threshold  rows
m <- list()
# Loop to find all values of d within a threshold b of each value of u
# The output list will have as many items as values of u
# For each list item, there may be up to several thousand matching rows in d
# Note that there's a timing command (system.time) in here to keep track of performance
system.time({
  for(i in 1:nrow(u)){
      m <- c(m, list(which(abs(d$PC1-u$PC1[i])<b & abs(d$PC2-u$PC2[i])<b)))
  } 
})
m

這樣可行。 但我認為將函數與apply（）一起使用會更有效。 這是...

# Make the user-defined function for the threshold matching
match <- function(x,...){
  which(abs(d$PC1-x[1])<b & abs(d$PC2-x[2])<b)
}
# Run the function with the apply() command.
system.time({
  m <- apply(u,1,match)
})

同樣，此套用功能可以正常工作，並且比for循環稍快，但僅略微加快。 這可能只是一個大數據問題，為此我需要更多的計算能力（或更多的時間！）。 但是我認為其他人可能會對偷偷摸摸的命令或函數語法有所考慮，從而大大加快了速度。 歡迎使用開箱即用的方法來找到這些匹配的行。

Answer 1

有點偷偷摸摸：

library(IRanges)
ur <- with(u*100L, IRanges(PC2, PC1))
dr <- with(d*100L, IRanges(PC2, PC1))
hits <- findOverlaps(ur, dr + b*100L)

一旦行數足夠大，應該很快。 我們乘以100進入整數空間。 將參數的順序反轉為findOverlaps可以提高性能。

Answer 2

las，這似乎只比for循環快一點

unlist(Map(function(x,y) {
    which(abs(d$PC1-x)<b & abs(d$PC2-y)<b)
}, u$PC1, u$PC2))

但至少是這樣

Answer 3

我有一個狡猾的計划:-)。 只做計算如何：

> set.seed(10)
> bar<-matrix(runif(10),nc=2)
> bar
           [,1]      [,2]
[1,] 0.50747820 0.2254366
[2,] 0.30676851 0.2745305
[3,] 0.42690767 0.2723051
[4,] 0.69310208 0.6158293
[5,] 0.08513597 0.4296715
> foo<-c(.3,.7)
> thresh<-foo-bar
> sign(thresh)
     [,1] [,2]
[1,]   -1    1
[2,]    1    1
[3,]   -1    1
[4,]    1   -1
[5,]    1    1

現在，您要做的就是選擇最后一個矩陣的行c(-1,1) ，使用which ，您可以輕松地從bar矩陣中提取所需的行。 對foo每一行重復一次。

需要與R函數更有效的閾值匹配

問題描述

3 個解決方案

解決方案1
4 2014-05-30 04:44:07

解決方案2
2 2014-05-30 04:13:11

解決方案3
1 2014-05-30 11:51:04

需要與R函數更有效的閾值匹配

問題描述

3 個解決方案

解決方案1 4 2014-05-30 04:44:07

解決方案2 2 2014-05-30 04:13:11

解決方案3 1 2014-05-30 11:51:04

解決方案1
4 2014-05-30 04:44:07

解決方案2
2 2014-05-30 04:13:11

解決方案3
1 2014-05-30 11:51:04