简体   繁体   English

如何加快R中的模糊索引匹配(可能使用Rcpp)?

[英]How to speed up fuzzy index matching in R (potentially using Rcpp)?

I am trying to look up a value from the given lookup table(lk_tbl) using, for the lack of a better word, fuzzy matching such as following: 我试图使用给定的查找表(lk_tbl)查找一个值,因为缺少更好的词,它使用模糊匹配,例如:

lk_tbl <- structure(list(num = c(1, 1.05, 1.1, 1.15, 1.2, 1.25, 1.3, 1.35, 
1.4), val = c(0.241970724519143, 0.229882140684233, 0.217852177032551, 
0.205936268719975, 0.194186054983213, 0.182649085389022, 0.171368592047807, 
0.16038332734192, 0.149727465635745)), .Names = c("num", "val"
), row.names = c(NA, -9L), class = "data.frame")

> lk_tbl
   num       val
1 1.00 0.2419707
2 1.05 0.2298821
3 1.10 0.2178522
4 1.15 0.2059363
5 1.20 0.1941861
6 1.25 0.1826491
7 1.30 0.1713686
8 1.35 0.1603833
9 1.40 0.1497275

Basically, the table pairs a number and its associated value. 基本上,该表将数字及其关联值配对。 Now If I want to find the value associated with the number 1.22 , which is not in the lk_tbl, I want to do sort of interpolation. 现在,如果我想找到与数字1.22相关联的值(不在lk_tbl中),我想进行某种插值。

fuzzy_lkup<- function(x) {
  matched_num <- lk_tbl %>% 
    filter(num==x)     # check for exact val

  if(nrow(matched_num) == 1 ) { # if the exact match exists
    return(matched_num$val)
  } 
  else {
  return(lk_tbl %>% 
    filter( x < num + 0.05, x > num -0.05 ) %>%
    .[["val"]] %>%
    mean())

  }
}


> fuzzy_lkup(1) # it returns the matched value
[1] 0.2419707
> fuzzy_lkup(1.22) # it does the interpolation
[1] 0.1884176

# for the vector, I can use vapply like this.
> vapply(c(1.22, 1.18, 1.24), fuzzy_lkup,numeric(1))
[1] 0.1884176 0.2000612 0.1884176

Ultimately, I want to do this for a long vector from the huge lookup table. 最终,我想从庞大的查找表中获取较长的向量。

Now what I observed is this process is prohibitly slow for the large lookup table So my question is 现在,我观察到的是,对于大型查找表,此过程非常缓慢,所以我的问题是

  1. How would you speed this up? 你会如何加快速度? (vectorize this function?) (矢量化此功能?)
  2. How would you solve this with Rcpp? 您将如何使用Rcpp解决此问题? Is Rcpp the right tool for this? Rcpp是正确的工具吗? how would you import the lookup table, what data structure would you use to solve this, and ultimately how do you solve this problem? 您将如何导入查找表,将使用什么数据结构来解决此问题,最终将如何解决此问题?

From your description, a potential approach could be: 根据您的描述,一种可能的方法可能是:

ff = function(x, num, val)
{
    i = findInterval(x, num)  #map input to the lookup-table

    #make the appropriate vectors to interpolate
    nums = c(rbind(num[i], x, num[i + 1L]))
    vals = c(rbind(val[i], NA, val[i + 1L]))

    #if 'mean' is needed; i.e. 'f(1.22) == f(1.24)' etc, the following could be used:
    #nums = seq_along(vals) 

    ans = approx(nums, vals, xout = nums)$y[seq(2L, length(nums), 3L)]

    return(cbind(x, ans))
}

And an example: 一个例子:

ff(c(1.22, 1.18, 1.24, 1.05, 1.2, 1.22, 1.23, 1.24, 1.4, 1.5), lk_tbl$num, lk_tbl$val)
#         x       ans
# [1,] 1.22 0.1895713
# [2,] 1.18 0.1988861
# [3,] 1.24 0.1849565
# [4,] 1.05 0.2298821
# [5,] 1.20 0.1941861
# [6,] 1.22 0.1895713
# [7,] 1.23 0.1872639
# [8,] 1.24 0.1849565
# [9,] 1.40 0.1497275
#[10,] 1.50        NA

To address the second question, the above can, also, conveniently be transferred in C thanks to R's API: 为了解决第二个问题,借助R的API,上述内容也可以方便地在C中传输:

ffC = inline::cfunction(sig = c(x = "numeric", num = "numeric", val = "numeric"), body = '
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));

    double *px = REAL(x), *pnum = REAL(num), *pval = REAL(val), *pans = REAL(ans);

    int n = LENGTH(num), flag;

    for(int i = 0, ind = 1; i < LENGTH(x); i++) {
        ind = findInterval(pnum, n, px[i], 0, 0, ind, &flag);

        pans[i] = ind == n ? (px[i] == pnum[n - 1] ? pval[n - 1] : NA_REAL) : 
             pval[ind - 1] + (pval[ind] - pval[ind - 1]) * 
                 ((px[i] - pnum[ind - 1]) / (pnum[ind] - pnum[ind - 1]));
    }

    UNPROTECT(1);
    return(ans);
', language = "C")

And benchmarking the two approaches: 并对两种方法进行基准测试:

NUM = seq(1, 100, 0.2)
set.seed(007)
VAL = runif(length(NUM))
X = sample(1:110, 1e5, TRUE) + sample(seq(0, 1, 0.01), 1e5, TRUE)

all.equal(ff(X, NUM, VAL)[, 2L], ffC(X, NUM, VAL))
#[1] TRUE
microbenchmark::microbenchmark(ff(X, NUM, VAL)[, 2L], ffC(X, NUM, VAL), times = 30)
#Unit: milliseconds
#                  expr        min         lq       mean     median         uq       max neval cld
# ff(X, NUM, VAL)[, 2L] 182.215633 222.755943 236.844409 225.315683 236.060114 366.74375    30   b
#      ffC(X, NUM, VAL)   6.927356   6.986864   7.375294   7.078041   7.198103  10.10846    30  a 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM