[英]How to speed up fuzzy index matching in R (potentially using Rcpp)?
I am trying to look up a value from the given lookup table(lk_tbl) using, for the lack of a better word, fuzzy matching such as following: 我试图使用给定的查找表(lk_tbl)查找一个值,因为缺少更好的词,它使用模糊匹配,例如:
lk_tbl <- structure(list(num = c(1, 1.05, 1.1, 1.15, 1.2, 1.25, 1.3, 1.35,
1.4), val = c(0.241970724519143, 0.229882140684233, 0.217852177032551,
0.205936268719975, 0.194186054983213, 0.182649085389022, 0.171368592047807,
0.16038332734192, 0.149727465635745)), .Names = c("num", "val"
), row.names = c(NA, -9L), class = "data.frame")
> lk_tbl
num val
1 1.00 0.2419707
2 1.05 0.2298821
3 1.10 0.2178522
4 1.15 0.2059363
5 1.20 0.1941861
6 1.25 0.1826491
7 1.30 0.1713686
8 1.35 0.1603833
9 1.40 0.1497275
Basically, the table pairs a number and its associated value. 基本上,该表将数字及其关联值配对。 Now If I want to find the value associated with the number
1.22
, which is not in the lk_tbl, I want to do sort of interpolation. 现在,如果我想找到与数字
1.22
相关联的值(不在lk_tbl中),我想进行某种插值。
fuzzy_lkup<- function(x) {
matched_num <- lk_tbl %>%
filter(num==x) # check for exact val
if(nrow(matched_num) == 1 ) { # if the exact match exists
return(matched_num$val)
}
else {
return(lk_tbl %>%
filter( x < num + 0.05, x > num -0.05 ) %>%
.[["val"]] %>%
mean())
}
}
> fuzzy_lkup(1) # it returns the matched value
[1] 0.2419707
> fuzzy_lkup(1.22) # it does the interpolation
[1] 0.1884176
# for the vector, I can use vapply like this.
> vapply(c(1.22, 1.18, 1.24), fuzzy_lkup,numeric(1))
[1] 0.1884176 0.2000612 0.1884176
Ultimately, I want to do this for a long vector from the huge lookup table. 最终,我想从庞大的查找表中获取较长的向量。
Now what I observed is this process is prohibitly slow for the large lookup table So my question is 现在,我观察到的是,对于大型查找表,此过程非常缓慢,所以我的问题是
From your description, a potential approach could be: 根据您的描述,一种可能的方法可能是:
ff = function(x, num, val)
{
i = findInterval(x, num) #map input to the lookup-table
#make the appropriate vectors to interpolate
nums = c(rbind(num[i], x, num[i + 1L]))
vals = c(rbind(val[i], NA, val[i + 1L]))
#if 'mean' is needed; i.e. 'f(1.22) == f(1.24)' etc, the following could be used:
#nums = seq_along(vals)
ans = approx(nums, vals, xout = nums)$y[seq(2L, length(nums), 3L)]
return(cbind(x, ans))
}
And an example: 一个例子:
ff(c(1.22, 1.18, 1.24, 1.05, 1.2, 1.22, 1.23, 1.24, 1.4, 1.5), lk_tbl$num, lk_tbl$val)
# x ans
# [1,] 1.22 0.1895713
# [2,] 1.18 0.1988861
# [3,] 1.24 0.1849565
# [4,] 1.05 0.2298821
# [5,] 1.20 0.1941861
# [6,] 1.22 0.1895713
# [7,] 1.23 0.1872639
# [8,] 1.24 0.1849565
# [9,] 1.40 0.1497275
#[10,] 1.50 NA
To address the second question, the above can, also, conveniently be transferred in C thanks to R's API: 为了解决第二个问题,借助R的API,上述内容也可以方便地在C中传输:
ffC = inline::cfunction(sig = c(x = "numeric", num = "numeric", val = "numeric"), body = '
SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
double *px = REAL(x), *pnum = REAL(num), *pval = REAL(val), *pans = REAL(ans);
int n = LENGTH(num), flag;
for(int i = 0, ind = 1; i < LENGTH(x); i++) {
ind = findInterval(pnum, n, px[i], 0, 0, ind, &flag);
pans[i] = ind == n ? (px[i] == pnum[n - 1] ? pval[n - 1] : NA_REAL) :
pval[ind - 1] + (pval[ind] - pval[ind - 1]) *
((px[i] - pnum[ind - 1]) / (pnum[ind] - pnum[ind - 1]));
}
UNPROTECT(1);
return(ans);
', language = "C")
And benchmarking the two approaches: 并对两种方法进行基准测试:
NUM = seq(1, 100, 0.2)
set.seed(007)
VAL = runif(length(NUM))
X = sample(1:110, 1e5, TRUE) + sample(seq(0, 1, 0.01), 1e5, TRUE)
all.equal(ff(X, NUM, VAL)[, 2L], ffC(X, NUM, VAL))
#[1] TRUE
microbenchmark::microbenchmark(ff(X, NUM, VAL)[, 2L], ffC(X, NUM, VAL), times = 30)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# ff(X, NUM, VAL)[, 2L] 182.215633 222.755943 236.844409 225.315683 236.060114 366.74375 30 b
# ffC(X, NUM, VAL) 6.927356 6.986864 7.375294 7.078041 7.198103 10.10846 30 a
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.