简体   繁体   English

快速 R 查找表

[英]fast R lookup table

Similar questions were asked before, but without clear generic answers.之前也有人问过类似的问题,但没有明确的通用答案。 (And Joseph Adler's experiments are no longer on the web, and his book just says "write an S4 class.") (Joseph Adler 的实验不再在网上,他的书只说“编写一个 S4 类”。)

Assume a large lookup table with multiple indexes.假设有一个包含多个索引的大型查找表。 Assume a modest size set of values to look up.假设要查找的值大小适中。 Even an R merge is very slow.即使是 R 合并也很慢。 Here is an example:下面是一个例子:

{
    L <- 100000000  ## only 100M entries for 1GB*4 of int data
    lookuptable  <- data.frame( i1=sample(1:L), i2=sample(1:L), v1=rnorm(L), v2=rnorm(L) )
    NLUP <- 10      ## look up only 10+1 values in large table
    vali <- sample(1:L, NLUP)
    lookmeup <- data.frame( i1= c(lookuptable[vali,1], -1),
                       i2= c(lookuptable[vali,2],-1), vA=rnorm(11) )
    rm(vali); rm(L)
}

## I want to speed this up---how?
system.time( merge( lookmeup, lookuptable,  by.x=c("i1","i3"), by.y=c("i1","i2"),
                   all.x=T, all.y=F, sort=F ) )

(Try it! 500 second on my 2019 iMac). (试试吧!在我的 2019 iMac 上 500 秒)。 So what is the recommended way of doing this?那么推荐的方法是什么?

I could write code that creates unique integer fingerprints from the columns first (for fast comparisons), and then I just match on one column.我可以编写代码,首先从列中创建唯一的整数指纹(为了快速比较),然后我只匹配一列。 But this is not easy either, 'cause I need to avoid accidental duplicate fingerprints, or add more logic for conflicts.但这也不容易,因为我需要避免意外的重复指纹,或者为冲突添加更多逻辑。

Given integer fingerprints, I could then use either data.table with setkey on the fingerprints (or can it encapsulate two-column indexes, too? I tried but failed, perhaps because I am not a regular user);鉴于整数指纹,我可以然后使用data.tablesetkey上的指纹(或者它可以封装两列索引,我也试过,但失败了,也许是因为我不是一个普通用户?); or I could write a C program that takes two integer fingerprint columns and returns one.或者我可以编写一个 C 程序,它接受两个整数指纹列并返回一个。

To match two data.frames on multiple columns you can use from base merge or match in combination with interaction , paste or use a list .匹配多个列上的两个data.frames您可以使用基本合并匹配交互粘贴或使用列表相结合。 It is also possible to map two integers to one, in a unique and deterministic way .也可以以独特且确定的方式将两个整数映射到一个 A simple extension is the fastmatch library which can be faster than match from base .一个简单的扩展是fastmatch库,它可以比来自base 的match更快。 Also dplyr or data.table can be an option. dplyrdata.table也可以是一个选项。 Have also a look at: Matching more than 2 conditions , How to join (merge) data frames and Fast single item lookup .也看看:匹配超过 2 个条件如何加入(合并)数据帧快速单项查找

library(fastmatch)
library(dplyr)
library(microbenchmark)
microbenchmark(times = 10L, setup = gc(), check = "equivalent"
 , merge = merge(lookMeUp, lookupTable, all.x=TRUE, sort=FALSE)
 , dplyr = left_join(lookMeUp, lookupTable, by = c("i1", "i2"))
 , inter = cbind(lookMeUp, lookupTable[match(interaction(lookMeUp[c("i1","i2")])
                 , interaction(lookupTable[c("i1","i2")])), 3:4])
 , paste = cbind(lookMeUp, lookupTable[match(paste(lookMeUp$i1, lookMeUp$i2)
                 , paste(lookupTable$i1, lookupTable$i2)), 3:4])
 , int = cbind(lookMeUp, lookupTable[match(lookMeUp$i1 + lookMeUp$i2 * max(lookupTable$i1)
                 , lookupTable$i1 + lookupTable$i2 * max(lookupTable$i1)), 3:4])
 , fInter = cbind(lookMeUp, lookupTable[fmatch(interaction(lookMeUp[c("i1","i2")])
                 , interaction(lookupTable[c("i1","i2")])), 3:4])
 , fPaste = cbind(lookMeUp, lookupTable[fmatch(paste(lookMeUp$i1, lookMeUp$i2)
                  , paste(lookupTable$i1, lookupTable$i2)), 3:4])
 , fint = cbind(lookMeUp, lookupTable[fmatch(lookMeUp$i1 + lookMeUp$i2 * max(lookMeUp$i1)
                 , lookupTable$i1 + lookupTable$i2 * max(lookMeUp$i1)), 3:4])
)
#Unit: milliseconds
#   expr        min         lq       mean     median         uq        max neval
#  merge 2547.72575 2564.72138 2590.03400 2578.14307 2585.01870 2735.23435    10
#  dplyr  690.55046  695.56161  703.01335  703.95085  707.32141  714.00890    10
#  inter  511.86378  514.36418  528.73905  529.14331  535.33359  552.20183    10
#  paste  750.01340  763.84494  942.47309  777.73232 1273.83380 1377.00192    10
#    int   71.56913   72.15233   73.27748   72.92613   73.89630   77.01510    10
# fInter  447.82012  450.00472  459.51196  455.82473  464.85767  491.52366    10
# fPaste  713.68824  719.60794  796.94680  726.70971  788.36997 1316.64071    10
#   fint   59.04541   59.13039   60.95638   60.59758   62.58539   63.65308    10

Instead of creating the unique identifier each time you make a look up, you can store it in the lookup table, what will make the lookup faster but you have an overhead in creating it.您可以将其存储在查找表中,而不是每次查找时都创建唯一标识符,这将使查找更快,但创建它会产生开销。 You can also sort the lookup table by this identifier which will allow accessing the data line without using match but this method will add not defined rows in case there are some combinations missing, what will be equivalent in creation a matrix or array .您还可以通过此标识符对查找表进行排序,这将允许在不使用match情况下访问数据行,但此方法将添加未定义的行,以防缺少某些组合,在创建matrixarray将等效。 You can also use the build in hash for looking up variables in an environment .您还可以使用内置散列来查找environment变量。 Also the binary search from findInterval can be used.也可以使用来自findInterval二分搜索

system.time({maxLTi1 <- max(lookupTable$i1); lookupTable$id <- lookupTable$i1 + lookupTable$i2 * maxLTi1})
#       User      System verstrichen 
#      0.006       0.000       0.006 

system.time(fmatch(c(lookupTable$id[1], 0), lookupTable$id))  #Create Hash
#       User      System verstrichen 
#      0.056       0.000       0.056 
#system.time(fmatch(lookupTable$id[1], lookupTable$id))  #Create Hash in case you have only matches
#       User      System verstrichen 
#      0.016       0.004       0.020 

system.time({
lookupTableS <- lookupTable[0,]
lookupTableS[lookupTable$id,] <- lookupTable #Sort Table with gaps
})
#       User      System verstrichen 
#      0.080       0.011       0.091 

system.time({
lookupTableS2 <- lookupTable[order(lookupTable$id),] #Sort Table
})
#       User      System verstrichen 
#      0.074       0.000       0.074 

library(Matrix)
system.time({ #Sorted Sparse Vector
  i <- order(lookupTable$id)
  lookupTableS3 <- sparseVector(i, lookupTable$id[i], max(lookupTable$id))})
#       User      System verstrichen 
#      0.057       0.008       0.065 

system.time(lupEnv <- list2env(setNames(as.list(seq_len(nrow(lookupTable))), paste(lookupTable$i1, lookupTable$i2))))
#       User      System verstrichen 
#      4.824       0.056       4.880 

library(data.table);
lookupTableDT <- as.data.table(copy(lookupTable))
lookMeUpDT <- as.data.table(copy(lookMeUp))
system.time(setkey(lookupTableDT, i1, i2))
#       User      System verstrichen 
#      0.094       0.000       0.027 

lookMeUpDT$id <- lookMeUp$i1 + lookMeUp$i2 * max(lookupTable$i1)
lookupTableDTId <- as.data.table(copy(lookupTable))
system.time(setkey(lookupTableDTId, id))
#       User      System verstrichen 
#      0.091       0.000       0.026 

lookMeUpDTId <- copy(lookMeUpDT)
lookMeUpDTId$row <- seq_len(nrow(lookMeUpDTId))
setkey(lookMeUpDTId, id)

microbenchmark(times = 10L, setup = gc(), check = "equivalent"
 , int = cbind(lookMeUp, lookupTable[match(lookMeUp$i1 + lookMeUp$i2 * max(lookupTable$i1)
                 , lookupTable$i1 + lookupTable$i2 * max(lookupTable$i1)), 3:4])
 , fint = cbind(lookMeUp, lookupTable[fmatch(lookMeUp$i1 + lookMeUp$i2 * max(lookMeUp$i1)
                 , lookupTable$i1 + lookupTable$i2 * max(lookMeUp$i1)), 3:4])
 , id = cbind(lookMeUp, lookupTable[match(lookMeUp$i1 + lookMeUp$i2 * maxLTi1
                 , lookupTable$id), 3:4])
 , sparid = {i <- lookMeUp$i1 + lookMeUp$i2 * maxLTi1
   j <- i
   j[i>0] <- as.vector(lookupTableS3[i[i>0]])
   cbind(lookMeUp, lookupTable[ifelse(j>0,j,NA), 3:4])}
 , DT = merge(lookMeUpDT[,1:3], lookupTableDT[,1:4], by=c("i1", "i2"), all.x=TRUE, sort = FALSE)
 , DTid = merge(lookMeUpDT, lookupTableDTId[,-2:-1], by=c("id"), all.x=TRUE, sort = FALSE)[,-1]
 , DiIdKey = merge(lookMeUpDTId, lookupTableDTId[,-2:-1], all.x=TRUE, sort = FALSE)[order(row),][,c(-1,-5)]
 , findInt = {i <- lookMeUp$i1 + lookMeUp$i2 * maxLTi1
    j  <- findInterval(i, lookupTableS2$id)
    j[j==0]  <- NA
    j[i != lookupTableS2$id[j]] <- NA
    cbind(lookMeUp, lookupTableS2[j, 3:4])}
 , envir = cbind(lookMeUp, lookupTable[vapply(paste(lookMeUp$i1, lookMeUp$i2), function(i) {x  <- lupEnv[[i]]; if(is.null(x)) NA else x}, 1), 3:4])
 , fid = cbind(lookMeUp, lookupTable[fmatch(lookMeUp$i1 + lookMeUp$i2 * maxLTi1
                 , lookupTable$id), 3:4])
 , sid = cbind(lookMeUp, lookupTableS[ifelse(lookMeUp$i1 > 0, lookMeUp$i1 + lookMeUp$i2 * maxLTi1, NA), 3:4])
)
#Unit: microseconds
#    expr       min        lq       mean     median        uq       max neval
#     int 75167.977 76446.819 77817.3349 77958.9650 78649.235 80656.715    10
#    fint 63332.436 63948.769 64574.8881 64194.2765 64942.559 66808.193    10
#      id 68198.639 69293.551 70477.6062 70223.0505 71393.354 74951.007    10
#  sparid  9181.928  9217.312  9552.0241  9478.8475  9561.917 10895.649    10
#      DT  4990.075  5000.857  5125.6716  5051.4970  5157.057  5547.220    10
#    DTid  4167.229  4189.703  4250.0804  4232.8955  4289.718  4440.924    10
# DiIdKey  4547.589  4582.915  4626.9514  4597.6790  4634.311  4867.630    10
# findInt  2795.560  2813.100  2854.7069  2815.4890  2857.084  3097.120    10
#   envir   526.971   530.459   537.5767   532.9755   546.402   551.231    10
#     fid   424.790   425.218   433.7295   433.3335   441.673   444.026    10
#     sid   436.135   439.688   445.1770   441.5705   445.331   464.685    10

#In case order and columns need not be like the others
microbenchmark(times = 10L, setup = gc(), unit = "us",
 DiIdKey = merge(lookMeUpDTId, lookupTableDTId, all.x=TRUE, sort = FALSE))
#Unit: microseconds
#    expr      min      lq     mean   median       uq     max neval
# DiIdKey 1692.629 1706.14 1719.556 1717.142 1722.067 1778.88    10

Creating a unique identifier and store it in the lookup table and using fmatch could be recommended .创建一个唯一的标识符,并将其存储在查找表和使用fmatch建议 In pure base the lookup table could be sorted by the ID and missing combinations will be filled with NA what allows direct access to the matching rows without using match .在纯基础中,查找表可以按 ID 排序,缺少的组合将用 NA 填充,这允许直接访问匹配的行而不使用match Alternatively the lookup can be done in an environment where the build in hash search is used but this has much overhead.或者,可以在使用内置哈希搜索的环境中完成查找,但这会产生很多开销。 Also using findInterval shows good results.还使用findInterval显示出良好的结果。

In case the columns are not (positive) integer cast them to factor and use their integer values.如果列不是(正) integer则将它们转换为factor并使用它们的整数值。

Data:数据:

set.seed(7)
sqrtN  <- 1e3
lookupTable <- data.frame(expand.grid(i1=seq_len(sqrtN), i2=seq_len(sqrtN)), v1=seq_len(sqrtN*sqrtN))[sample(sqrtN*sqrtN),]
lookupTable$v2  <- seq_len(sqrtN*sqrtN)

lookMeUp <- rbind(lookupTable[sample(nrow(lookupTable), 10), 1:2], c(-1, -1))
lookMeUp$vA <- letters[1:nrow(lookMeUp)]

Timings of lookuptable with 5e7 rows:具有 5e7 行的查找表的时间:

sqrtN  <- 7.1e3
lookupTable <- data.frame(expand.grid(i1=seq_len(sqrtN), i2=seq_len(sqrtN)), v1=seq_len(sqrtN*sqrtN))[sample(sqrtN*sqrtN),]
lookupTable$v2  <- seq_len(sqrtN*sqrtN)

lookMeUp <- rbind(lookupTable[sample(nrow(lookupTable), 10), 1:2], c(-1, -1))
lookMeUp$vA <- letters[1:nrow(lookMeUp)]

system.time({maxLTi1 <- max(lookupTable$i1); lookupTable$id <- lookupTable$i1 + lookupTable$i2 * maxLTi1})
#       User      System verstrichen 
#      0.312       0.016       0.329 

system.time(lookupTable <- lookupTable[order(lookupTable$id),]) #For findIntervall
#       User      System verstrichen 
#      6.786       0.120       6.905 

system.time({
i <- lookMeUp$i1 + lookMeUp$i2 * maxLTi1
j  <- findInterval(i, lookupTable$id)
j[j==0]  <- NA
j[i != lookupTable$id[j]] <- NA
cbind(lookMeUp, lookupTable[j, 3:4])
})
#       User      System verstrichen 
#      0.099       0.048       0.147 

system.time(fmatch(c(lookupTable$id[1], 0), lookupTable$id)) #Create Hash
#       User      System verstrichen 
#      2.642       0.120       2.762 

system.time(cbind(lookMeUp, lookupTable[fmatch(lookMeUp$i1 + lookMeUp$i2 * maxLTi1, lookupTable$id), 3:4]))
#       User      System verstrichen 
#          0           0           0 

I finally broke down and made this a more generic function:我终于崩溃了,使它成为一个更通用的函数:

set.seed(0); K <- 1000; M <- K*K
rint <- function( n, minv=0, maxv=NA ) sample( minv:(if (is.na(maxv)) n else maxv), n, repl=T )


dict.lookup <- function( dwords, dictionary, by=NULL, by.w=NULL, by.d=NULL ) {
                                        # bad style, but just (mostly symmetric) error checking
    if (is.null(by.d)) by.d <- by; if (is.null(by.w)) by.w <- by
    stopifnot( (!is.null(by.d)) & (!is.null(by.w)))
                                        # valid input checking
    stopifnot( is.data.frame( dwords ) ); stopifnot( is.data.frame( dictionary ) )
    stopifnot( nrow( dwords ) > 0 ); stopifnot( nrow( dictionary ) > 0 )
    stopifnot( is.character(by.w) ); stopifnot( is.character(by.d) )
    stopifnot( length(by.w)==1 ); stopifnot( length(by.d)==1)
    stopifnot( by.w %in% names(dwords) ); stopifnot( by.d %in% names(dictionary) )
                                        # you cannot give the words directly.  hash them first
    stopifnot( is.numeric( dwords[[by.w]] ) )
    stopifnot( is.numeric( dictionary[[by.d]] ) )
                                        # a dictionary should have only unique entries
    stopifnot( anyDuplicated( dictionary[[by.d]] ) == 0 )

                                        # the actual work
    toright <- dictionary[ match(dwords[[by.w]], dictionary[[by.d]]), ]
    cbind(dwords, toright[ , names(toright) != by.d ])
}


L <- 100*M  ## only 100M entries for 1GB*4 of int data
dictionary  <- data.frame( idictwords=sample(1:L), cost2print=rint(L, 1,100),  tiresomeness=rint(L, 100,200) )
message("created dictionary")

## look up 10+1 words
dwords <- data.frame( imywords= c(dictionary[ sample(1:L, 10) , "idictwords"], -99),  frombook=rint(11, 200,300) )
message("created my words")

print( system.time( o <- dict.lookup( dwords, dictionary, by.w= "imywords", by.d= "idictwords" ) ) )
message("looked up my words in dictionary done")

print(o)

gives me给我

   user  system elapsed 
 13.746   0.739  14.489 

         imywords frombook cost2print tiresomeness
68533657 88509161      263         25          110
87030862 75614422      297         23          164
16923080 79185053      249         84          105
62235248 84542527      292         72          141
4044547  35212219      201         13          155
95995528 67895828      257          4          122
43031831 24227004      281         86          101
76602707 55164521      270         52          151
53380001 87665273      207         35          121
24278223 30085231      238          6          153
NA            -99      205         NA           NA

the rownames are the matching rows from the dictionary data frame.行名是字典数据框中的匹配行。

I often tinker with functions (often to get better handling).我经常修改函数(通常是为了更好地处理)。 feel free to suggest changes.随时提出更改建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM