[英]fast R lookup table
Similar questions were asked before, but without clear generic answers.之前也有人问过类似的问题,但没有明确的通用答案。 (And Joseph Adler's experiments are no longer on the web, and his book just says "write an S4 class.") (Joseph Adler 的实验不再在网上,他的书只说“编写一个 S4 类”。)
Assume a large lookup table with multiple indexes.假设有一个包含多个索引的大型查找表。 Assume a modest size set of values to look up.假设要查找的值大小适中。 Even an R merge is very slow.即使是 R 合并也很慢。 Here is an example:下面是一个例子:
{
L <- 100000000 ## only 100M entries for 1GB*4 of int data
lookuptable <- data.frame( i1=sample(1:L), i2=sample(1:L), v1=rnorm(L), v2=rnorm(L) )
NLUP <- 10 ## look up only 10+1 values in large table
vali <- sample(1:L, NLUP)
lookmeup <- data.frame( i1= c(lookuptable[vali,1], -1),
i2= c(lookuptable[vali,2],-1), vA=rnorm(11) )
rm(vali); rm(L)
}
## I want to speed this up---how?
system.time( merge( lookmeup, lookuptable, by.x=c("i1","i3"), by.y=c("i1","i2"),
all.x=T, all.y=F, sort=F ) )
(Try it! 500 second on my 2019 iMac). (试试吧!在我的 2019 iMac 上 500 秒)。 So what is the recommended way of doing this?那么推荐的方法是什么?
I could write code that creates unique integer fingerprints from the columns first (for fast comparisons), and then I just match on one column.我可以编写代码,首先从列中创建唯一的整数指纹(为了快速比较),然后我只匹配一列。 But this is not easy either, 'cause I need to avoid accidental duplicate fingerprints, or add more logic for conflicts.但这也不容易,因为我需要避免意外的重复指纹,或者为冲突添加更多逻辑。
Given integer fingerprints, I could then use either data.table
with setkey
on the fingerprints (or can it encapsulate two-column indexes, too? I tried but failed, perhaps because I am not a regular user);鉴于整数指纹,我可以然后使用data.table
用setkey
上的指纹(或者它可以封装两列索引,我也试过,但失败了,也许是因为我不是一个普通用户?); or I could write a C program that takes two integer fingerprint columns and returns one.或者我可以编写一个 C 程序,它接受两个整数指纹列并返回一个。
To match two data.frames
on multiple columns you can use from base merge or match in combination with interaction , paste or use a list .要匹配多个列上的两个data.frames
,您可以使用基本合并或匹配与交互、 粘贴或使用列表相结合。 It is also possible to map two integers to one, in a unique and deterministic way .也可以以独特且确定的方式将两个整数映射到一个。 A simple extension is the fastmatch
library which can be faster than match
from base .一个简单的扩展是fastmatch
库,它可以比来自base 的match
更快。 Also dplyr
or data.table
can be an option. dplyr
或data.table
也可以是一个选项。 Have also a look at: Matching more than 2 conditions , How to join (merge) data frames and Fast single item lookup .也看看:匹配超过 2 个条件,如何加入(合并)数据帧和快速单项查找。
library(fastmatch)
library(dplyr)
library(microbenchmark)
microbenchmark(times = 10L, setup = gc(), check = "equivalent"
, merge = merge(lookMeUp, lookupTable, all.x=TRUE, sort=FALSE)
, dplyr = left_join(lookMeUp, lookupTable, by = c("i1", "i2"))
, inter = cbind(lookMeUp, lookupTable[match(interaction(lookMeUp[c("i1","i2")])
, interaction(lookupTable[c("i1","i2")])), 3:4])
, paste = cbind(lookMeUp, lookupTable[match(paste(lookMeUp$i1, lookMeUp$i2)
, paste(lookupTable$i1, lookupTable$i2)), 3:4])
, int = cbind(lookMeUp, lookupTable[match(lookMeUp$i1 + lookMeUp$i2 * max(lookupTable$i1)
, lookupTable$i1 + lookupTable$i2 * max(lookupTable$i1)), 3:4])
, fInter = cbind(lookMeUp, lookupTable[fmatch(interaction(lookMeUp[c("i1","i2")])
, interaction(lookupTable[c("i1","i2")])), 3:4])
, fPaste = cbind(lookMeUp, lookupTable[fmatch(paste(lookMeUp$i1, lookMeUp$i2)
, paste(lookupTable$i1, lookupTable$i2)), 3:4])
, fint = cbind(lookMeUp, lookupTable[fmatch(lookMeUp$i1 + lookMeUp$i2 * max(lookMeUp$i1)
, lookupTable$i1 + lookupTable$i2 * max(lookMeUp$i1)), 3:4])
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# merge 2547.72575 2564.72138 2590.03400 2578.14307 2585.01870 2735.23435 10
# dplyr 690.55046 695.56161 703.01335 703.95085 707.32141 714.00890 10
# inter 511.86378 514.36418 528.73905 529.14331 535.33359 552.20183 10
# paste 750.01340 763.84494 942.47309 777.73232 1273.83380 1377.00192 10
# int 71.56913 72.15233 73.27748 72.92613 73.89630 77.01510 10
# fInter 447.82012 450.00472 459.51196 455.82473 464.85767 491.52366 10
# fPaste 713.68824 719.60794 796.94680 726.70971 788.36997 1316.64071 10
# fint 59.04541 59.13039 60.95638 60.59758 62.58539 63.65308 10
Instead of creating the unique identifier each time you make a look up, you can store it in the lookup table, what will make the lookup faster but you have an overhead in creating it.您可以将其存储在查找表中,而不是每次查找时都创建唯一标识符,这将使查找更快,但创建它会产生开销。 You can also sort the lookup table by this identifier which will allow accessing the data line without using match
but this method will add not defined rows in case there are some combinations missing, what will be equivalent in creation a matrix
or array
.您还可以通过此标识符对查找表进行排序,这将允许在不使用match
情况下访问数据行,但此方法将添加未定义的行,以防缺少某些组合,在创建matrix
或array
将等效。 You can also use the build in hash for looking up variables in an environment
.您还可以使用内置散列来查找environment
变量。 Also the binary search from findInterval
can be used.也可以使用来自findInterval
的二分搜索。
system.time({maxLTi1 <- max(lookupTable$i1); lookupTable$id <- lookupTable$i1 + lookupTable$i2 * maxLTi1})
# User System verstrichen
# 0.006 0.000 0.006
system.time(fmatch(c(lookupTable$id[1], 0), lookupTable$id)) #Create Hash
# User System verstrichen
# 0.056 0.000 0.056
#system.time(fmatch(lookupTable$id[1], lookupTable$id)) #Create Hash in case you have only matches
# User System verstrichen
# 0.016 0.004 0.020
system.time({
lookupTableS <- lookupTable[0,]
lookupTableS[lookupTable$id,] <- lookupTable #Sort Table with gaps
})
# User System verstrichen
# 0.080 0.011 0.091
system.time({
lookupTableS2 <- lookupTable[order(lookupTable$id),] #Sort Table
})
# User System verstrichen
# 0.074 0.000 0.074
library(Matrix)
system.time({ #Sorted Sparse Vector
i <- order(lookupTable$id)
lookupTableS3 <- sparseVector(i, lookupTable$id[i], max(lookupTable$id))})
# User System verstrichen
# 0.057 0.008 0.065
system.time(lupEnv <- list2env(setNames(as.list(seq_len(nrow(lookupTable))), paste(lookupTable$i1, lookupTable$i2))))
# User System verstrichen
# 4.824 0.056 4.880
library(data.table);
lookupTableDT <- as.data.table(copy(lookupTable))
lookMeUpDT <- as.data.table(copy(lookMeUp))
system.time(setkey(lookupTableDT, i1, i2))
# User System verstrichen
# 0.094 0.000 0.027
lookMeUpDT$id <- lookMeUp$i1 + lookMeUp$i2 * max(lookupTable$i1)
lookupTableDTId <- as.data.table(copy(lookupTable))
system.time(setkey(lookupTableDTId, id))
# User System verstrichen
# 0.091 0.000 0.026
lookMeUpDTId <- copy(lookMeUpDT)
lookMeUpDTId$row <- seq_len(nrow(lookMeUpDTId))
setkey(lookMeUpDTId, id)
microbenchmark(times = 10L, setup = gc(), check = "equivalent"
, int = cbind(lookMeUp, lookupTable[match(lookMeUp$i1 + lookMeUp$i2 * max(lookupTable$i1)
, lookupTable$i1 + lookupTable$i2 * max(lookupTable$i1)), 3:4])
, fint = cbind(lookMeUp, lookupTable[fmatch(lookMeUp$i1 + lookMeUp$i2 * max(lookMeUp$i1)
, lookupTable$i1 + lookupTable$i2 * max(lookMeUp$i1)), 3:4])
, id = cbind(lookMeUp, lookupTable[match(lookMeUp$i1 + lookMeUp$i2 * maxLTi1
, lookupTable$id), 3:4])
, sparid = {i <- lookMeUp$i1 + lookMeUp$i2 * maxLTi1
j <- i
j[i>0] <- as.vector(lookupTableS3[i[i>0]])
cbind(lookMeUp, lookupTable[ifelse(j>0,j,NA), 3:4])}
, DT = merge(lookMeUpDT[,1:3], lookupTableDT[,1:4], by=c("i1", "i2"), all.x=TRUE, sort = FALSE)
, DTid = merge(lookMeUpDT, lookupTableDTId[,-2:-1], by=c("id"), all.x=TRUE, sort = FALSE)[,-1]
, DiIdKey = merge(lookMeUpDTId, lookupTableDTId[,-2:-1], all.x=TRUE, sort = FALSE)[order(row),][,c(-1,-5)]
, findInt = {i <- lookMeUp$i1 + lookMeUp$i2 * maxLTi1
j <- findInterval(i, lookupTableS2$id)
j[j==0] <- NA
j[i != lookupTableS2$id[j]] <- NA
cbind(lookMeUp, lookupTableS2[j, 3:4])}
, envir = cbind(lookMeUp, lookupTable[vapply(paste(lookMeUp$i1, lookMeUp$i2), function(i) {x <- lupEnv[[i]]; if(is.null(x)) NA else x}, 1), 3:4])
, fid = cbind(lookMeUp, lookupTable[fmatch(lookMeUp$i1 + lookMeUp$i2 * maxLTi1
, lookupTable$id), 3:4])
, sid = cbind(lookMeUp, lookupTableS[ifelse(lookMeUp$i1 > 0, lookMeUp$i1 + lookMeUp$i2 * maxLTi1, NA), 3:4])
)
#Unit: microseconds
# expr min lq mean median uq max neval
# int 75167.977 76446.819 77817.3349 77958.9650 78649.235 80656.715 10
# fint 63332.436 63948.769 64574.8881 64194.2765 64942.559 66808.193 10
# id 68198.639 69293.551 70477.6062 70223.0505 71393.354 74951.007 10
# sparid 9181.928 9217.312 9552.0241 9478.8475 9561.917 10895.649 10
# DT 4990.075 5000.857 5125.6716 5051.4970 5157.057 5547.220 10
# DTid 4167.229 4189.703 4250.0804 4232.8955 4289.718 4440.924 10
# DiIdKey 4547.589 4582.915 4626.9514 4597.6790 4634.311 4867.630 10
# findInt 2795.560 2813.100 2854.7069 2815.4890 2857.084 3097.120 10
# envir 526.971 530.459 537.5767 532.9755 546.402 551.231 10
# fid 424.790 425.218 433.7295 433.3335 441.673 444.026 10
# sid 436.135 439.688 445.1770 441.5705 445.331 464.685 10
#In case order and columns need not be like the others
microbenchmark(times = 10L, setup = gc(), unit = "us",
DiIdKey = merge(lookMeUpDTId, lookupTableDTId, all.x=TRUE, sort = FALSE))
#Unit: microseconds
# expr min lq mean median uq max neval
# DiIdKey 1692.629 1706.14 1719.556 1717.142 1722.067 1778.88 10
Creating a unique identifier and store it in the lookup table and using fmatch
could be recommended .创建一个唯一的标识符,并将其存储在查找表和使用fmatch
可建议。 In pure base the lookup table could be sorted by the ID and missing combinations will be filled with NA what allows direct access to the matching rows without using match
.在纯基础中,查找表可以按 ID 排序,缺少的组合将用 NA 填充,这允许直接访问匹配的行而不使用match
。 Alternatively the lookup can be done in an environment where the build in hash search is used but this has much overhead.或者,可以在使用内置哈希搜索的环境中完成查找,但这会产生很多开销。 Also using findInterval
shows good results.还使用findInterval
显示出良好的结果。
In case the columns are not (positive) integer
cast them to factor
and use their integer values.如果列不是(正) integer
则将它们转换为factor
并使用它们的整数值。
Data:数据:
set.seed(7)
sqrtN <- 1e3
lookupTable <- data.frame(expand.grid(i1=seq_len(sqrtN), i2=seq_len(sqrtN)), v1=seq_len(sqrtN*sqrtN))[sample(sqrtN*sqrtN),]
lookupTable$v2 <- seq_len(sqrtN*sqrtN)
lookMeUp <- rbind(lookupTable[sample(nrow(lookupTable), 10), 1:2], c(-1, -1))
lookMeUp$vA <- letters[1:nrow(lookMeUp)]
Timings of lookuptable with 5e7 rows:具有 5e7 行的查找表的时间:
sqrtN <- 7.1e3
lookupTable <- data.frame(expand.grid(i1=seq_len(sqrtN), i2=seq_len(sqrtN)), v1=seq_len(sqrtN*sqrtN))[sample(sqrtN*sqrtN),]
lookupTable$v2 <- seq_len(sqrtN*sqrtN)
lookMeUp <- rbind(lookupTable[sample(nrow(lookupTable), 10), 1:2], c(-1, -1))
lookMeUp$vA <- letters[1:nrow(lookMeUp)]
system.time({maxLTi1 <- max(lookupTable$i1); lookupTable$id <- lookupTable$i1 + lookupTable$i2 * maxLTi1})
# User System verstrichen
# 0.312 0.016 0.329
system.time(lookupTable <- lookupTable[order(lookupTable$id),]) #For findIntervall
# User System verstrichen
# 6.786 0.120 6.905
system.time({
i <- lookMeUp$i1 + lookMeUp$i2 * maxLTi1
j <- findInterval(i, lookupTable$id)
j[j==0] <- NA
j[i != lookupTable$id[j]] <- NA
cbind(lookMeUp, lookupTable[j, 3:4])
})
# User System verstrichen
# 0.099 0.048 0.147
system.time(fmatch(c(lookupTable$id[1], 0), lookupTable$id)) #Create Hash
# User System verstrichen
# 2.642 0.120 2.762
system.time(cbind(lookMeUp, lookupTable[fmatch(lookMeUp$i1 + lookMeUp$i2 * maxLTi1, lookupTable$id), 3:4]))
# User System verstrichen
# 0 0 0
I finally broke down and made this a more generic function:我终于崩溃了,使它成为一个更通用的函数:
set.seed(0); K <- 1000; M <- K*K
rint <- function( n, minv=0, maxv=NA ) sample( minv:(if (is.na(maxv)) n else maxv), n, repl=T )
dict.lookup <- function( dwords, dictionary, by=NULL, by.w=NULL, by.d=NULL ) {
# bad style, but just (mostly symmetric) error checking
if (is.null(by.d)) by.d <- by; if (is.null(by.w)) by.w <- by
stopifnot( (!is.null(by.d)) & (!is.null(by.w)))
# valid input checking
stopifnot( is.data.frame( dwords ) ); stopifnot( is.data.frame( dictionary ) )
stopifnot( nrow( dwords ) > 0 ); stopifnot( nrow( dictionary ) > 0 )
stopifnot( is.character(by.w) ); stopifnot( is.character(by.d) )
stopifnot( length(by.w)==1 ); stopifnot( length(by.d)==1)
stopifnot( by.w %in% names(dwords) ); stopifnot( by.d %in% names(dictionary) )
# you cannot give the words directly. hash them first
stopifnot( is.numeric( dwords[[by.w]] ) )
stopifnot( is.numeric( dictionary[[by.d]] ) )
# a dictionary should have only unique entries
stopifnot( anyDuplicated( dictionary[[by.d]] ) == 0 )
# the actual work
toright <- dictionary[ match(dwords[[by.w]], dictionary[[by.d]]), ]
cbind(dwords, toright[ , names(toright) != by.d ])
}
L <- 100*M ## only 100M entries for 1GB*4 of int data
dictionary <- data.frame( idictwords=sample(1:L), cost2print=rint(L, 1,100), tiresomeness=rint(L, 100,200) )
message("created dictionary")
## look up 10+1 words
dwords <- data.frame( imywords= c(dictionary[ sample(1:L, 10) , "idictwords"], -99), frombook=rint(11, 200,300) )
message("created my words")
print( system.time( o <- dict.lookup( dwords, dictionary, by.w= "imywords", by.d= "idictwords" ) ) )
message("looked up my words in dictionary done")
print(o)
gives me给我
user system elapsed
13.746 0.739 14.489
imywords frombook cost2print tiresomeness
68533657 88509161 263 25 110
87030862 75614422 297 23 164
16923080 79185053 249 84 105
62235248 84542527 292 72 141
4044547 35212219 201 13 155
95995528 67895828 257 4 122
43031831 24227004 281 86 101
76602707 55164521 270 52 151
53380001 87665273 207 35 121
24278223 30085231 238 6 153
NA -99 205 NA NA
the rownames are the matching rows from the dictionary data frame.行名是字典数据框中的匹配行。
I often tinker with functions (often to get better handling).我经常修改函数(通常是为了更好地处理)。 feel free to suggest changes.随时提出更改建议。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.