简体   繁体   English

从列表到data.table与hash的R快速单项查找

[英]R fast single item lookup from list vs data.table vs hash

One of the problems I often face is needing to look up an arbitrary row from a data.table. 我经常遇到的一个问题是需要从data.table中查找任意行。 I ran into a problem yesterday where I was trying to speed up a loop and using profvis I found that the lookup from the data.table was the most costly part of the loop. 我昨天遇到了一个问题,我试图加快循环使用profvis我发现,从查找data.table是循环中最昂贵的部分。 I then decided to try and find the fastest way to do a single item lookup in R. 然后我决定尝试找到在R中进行单项查找的最快方法。

The data often takes the form of a data.table with a key column of the character type. 数据通常采用data.table的形式,其中data.table字符类型的键列。 The remaining columns are typically numeric values. 其余列通常是数值。 I tried to create a random table with similar characteristics to what I often deal with which means >100K rows. 我试图创建一个随机表,其特征与我经常处理的相似,这意味着> 100K行。 I compared the native list, data.table package and the hash package. 我比较了本机列表, data.table包和hash包。 The native list and data.table were comparable for individual item lookup performance. 本机列表和data.table在单个项目查找性能方面具有可比性。 Hash appeared to be two orders of magnitude faster. Hash似乎快了两个数量级。 The tests were made up of 10 sets of 10,000 keys randomly sampled to provide for variance in access behavior. 测试由随机抽样的10组10,000个密钥组成,以提供访问行为的变化。 Each lookup method used the same sets of keys. 每种查找方法都使用相同的密钥集。

Ultimately my preference would be to either get the row lookup for data.table to be faster instead of having to create a hash table of my data or establish that it cannot be done and just use the hash package when I have to do fast lookup. 最终我的首选是要让data.table的行查找更快,而不是必须创建我的数据的哈希表,或者确定它不能完成,只需在我必须快速查找时使用哈希包。 I don't know if it would be possible but could you create a hash table of references to the rows in the data.table to allow for fast lookup using the hash package? 我不知道是否可能,但是你可以创建一个对data.table中行的引用的哈希表,以允许使用哈希包快速查找吗? I know that type of thing is possible in C++ but to my knowledge R does not allow this kind of thing due to the lack of pointers. 我知道在C ++中这种类型的东西是可能的,但据我所知,由于缺少指针,R不允许这种事情。

To Summarize: 1) Am I using data.table correctly for the lookups and therefore this is the speed I should expect for a single row lookup? 总结一下:1)我是否正确地使用data.table进行查找,因此这是单行查找所需的速度? 2) Would it be possible to create a hash of pointers to the data.table rows to allow for fast lookup that way? 2)是否可以创建指向data.table行的指针散列,以便以这种方式快速查找?

Test System: 测试系统:

Windows 10 Pro x64 Windows 10 Pro x64

R 3.2.2 R 3.2.2

data.table 1.9.6 data.table 1.9.6

hash 2.2.6 哈希2.2.6

Intel Core i7-5600U with 16 GB RAM Intel Core i7-5600U,16 GB RAM

Code: 码:

library(microbenchmarkCore) # install.packages("microbenchmarkCore", repos="http://olafmersmann.github.io/drat")
library(data.table)
library(hash)

# Set seed to 42 to ensure repeatability
set.seed(42)

# Setting up test ------

# Generate product ids
product_ids <- as.vector(
  outer(LETTERS[seq(1, 26, 1)],
    outer(outer(LETTERS[seq(1, 26, 1)], LETTERS[seq(1, 26, 1)], paste, sep=""),
          LETTERS[seq(1, 26, 1)], paste, sep = ""
    ), paste, sep = ""
  )
)

# Create test lookup data
test_lookup_list <- lapply(product_ids, function(id){
  return_list <- list(
    product_id = id,
    val_1 = rnorm(1),
    val_2 = rnorm(1),
    val_3 = rnorm(1),
    val_4 = rnorm(1),
    val_5 = rnorm(1),
    val_6 = rnorm(1),
    val_7 = rnorm(1),
    val_8 = rnorm(1)
  )
  return(return_list)
})

# Set names of items in list
names(test_lookup_list) <- sapply(test_lookup_list, function(elem) elem[['product_id']])

# Create lookup hash
lookup_hash <- hash(names(test_lookup_list), test_lookup_list)

# Create data.table from list and set key of data.table to product_id field
test_lookup_dt <- rbindlist(test_lookup_list)
setkey(test_lookup_dt, product_id)

test_lookup_env <- list2env(test_lookup_list)

# Generate sample of keys to be used for speed testing
lookup_tests <- lapply(1:10, function(x){
  lookups <- sample(test_lookup_dt$product_id, 10000)
  return(lookups)
})

# Native list timing
native_list_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- test_lookup_list[[lookup]]
    }    
  )
  return(timing[['elapsed']])
})

# Data.table timing
datatable_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- test_lookup_dt[lookup]
    }
  )
  return(timing[['elapsed']])
})


# Hashtable timing
hashtable_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- lookup_hash[[lookup]]
    }
  )
  return(timing[['elapsed']])
})

# Environment timing
environment_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- test_lookup_env[[lookup]]
    }
  )
  return(timing[['elapsed']])
})

# Summary of timing results
summary(native_list_timings)
summary(datatable_timings)
summary(hashtable_timings)
summary(environment_timings)

These were the results: 这些是结果:

> # Summary of timing results
> summary(native_list_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  35.12   36.20   37.28   37.05   37.71   39.24 
> summary(datatable_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  49.13   51.51   52.64   52.76   54.39   55.13 
> summary(hashtable_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1588  0.1857  0.2107  0.2213  0.2409  0.3258 
> summary(environment_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09322 0.09524 0.10680 0.11850 0.13760 0.17140 

It appears that the hash lookup is approximately two orders of magnitude faster than either the native list or data.table in this particular scenario. 在这种特定情况下, hash查找似乎比本机列表或data.table快大约两个数量级。

Update: 2015-12-11 3:00PM PST 更新:2015-12-11太平洋标准时间下午3点

I received feedback from Neal Fultz suggesting the use of the native Environment object. 我收到了Neal Fultz的反馈,建议使用本机Environment对象。 Here is the code and result I got: 这是我得到的代码和结果:

test_lookup_env <- list2env(test_lookup_list)
# Environment timing
environment_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- test_lookup_env[[lookup]]
    }
  )
  return(timing[['elapsed']])
})
summary(environment_timings)
> summary(environment_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09322 0.09524 0.10680 0.11850 0.13760 0.17140 

Indeed, it appears that Environment is faster for individual item access in this scenario. 实际上,在这种情况下,环境对于单个项目访问来说似乎更快。 Thank you Neal Fultz for pointing this method out. 谢谢你Neal Fultz指出这个方法。 I appreciate having a more thorough understanding of the object types available in R. My questions still stand: am I using data.table correctly (I expect so but I am open to critique) and is there a way to provide row access to the rows of a data.table using some kind of pointer magic which would provide faster individual row access. 我很欣赏能够更全面地理解R中可用的对象类型。我的问题仍然存在:我是否正确使用data.table (我希望如此,但我愿意批评)并且有一种方法可以提供对行的行访问使用某种指针魔法的data.table ,它可以提供更快的单独行访问。

Clarification: 2015-12-11 3:52 PM PST 澄清:2015-12-11 3:52太平洋标准时间

There have been some mentions that my access pattern in the inner-most loop of my test is inefficient. 有人提到我在我测试的最内层循环中的访问模式是低效的。 I agree. 我同意。 What I am trying to do is emulate as closely as possible the situation that I am dealing with. 我想要做的是尽可能地模仿我正在处理的情况。 The loop that this is actually occurring in does not allow for vectorization which is why I am not using it. 这实际发生的循环不允许矢量化,这就是我不使用它的原因。 I realize this is not strictly the 'R' way of doing things. 我意识到这不是严格意义上的'R'做事方式。 The data.table in my code is providing reference information and I do not necessarily know which row I need until I am inside the loop which is why I am trying to figure out how to access an individual item as quickly as possible, preferably with the data still stored in a data.table . 我的代码中的data.table提供了参考信息,在我进入循环之前我不一定知道我需要哪一行,这就是我试图弄清楚如何尽快访问单个项目的原因,最好是数据仍然存储在data.table This is also in part a curiosity question, can it be done? 这也是一个好奇心问题,可以做到吗?

Update 2: 2015-12-11 4:12 PM PST 更新2:2015-12-11 4:12太平洋标准时间

I received feedback from @jangrorecki that using Sys.time() is an ineffective means of measuring the performance of a function. 我收到了来自@jangrorecki的反馈,即使用Sys.time()是衡量函数性能的无效方法。 I have since revised the code to use system.nanotime() per the suggestion. 我已修改代码,根据建议使用system.nanotime() The original code has been updated and the timing results. 原始代码已更新并且计时结果。

The question still stands: is this the fastest way to do a row lookup of a data.table and if so is it possible to create a hash of pointers to the rows for quick lookup? 问题仍然存在:这是对data.table执行行查找的最快方法吗?如果可以的话,是否可以创建指向行的指针的哈希以便快速查找? At this point I am most curious how far R can be pushed. 在这一点上,我最好奇R可以推动多远。 As someone who came from C++, this is a fun challenge. 作为来自C ++的人,这是一个有趣的挑战。

Conclusion 结论

I accepted the answer provided by Neal Fultz because it discussed what I was actually wanting to know. 我接受了Neal Fultz提供的答案,因为它讨论了我真正想知道的事情。 That said, this is not the way data.table was intended to be used so no one should interpret this to mean data.table is slow, it is actually incredibly fast. 也就是说,这不是data.table使用方式,因此没有人应该将其解释为data.table很慢,实际上速度非常快。 This was a very particular use case that I was curious about. 这是一个非常特殊的用例,我很好奇。 My data comes in as a data.table and I wanted to know if I could get quick row access while leaving it as a data.table . 我的数据以data.table ,我想知道是否可以快速访问行,同时将其保留为data.table I also wanted to compare the data.table access speed with a hash-table which is what is often used for fast, non-vectorized item lookup. 我还想将data.table访问速度与散列表进行比较,散列表通常用于快速,非矢量化项目查找。

For a non-vectorized access pattern, you might want to try the builtin environment objects: 对于非向量化访问模式,您可能希望尝试内置environment对象:

require(microbenchmark)

test_lookup_env <- list2env(test_lookup_list)


x <- lookup_tests[[1]][1]
microbenchmark(
    lookup_hash[[x]],
    test_lookup_list[[x]],
    test_lookup_dt[x],
    test_lookup_env[[x]]
)

Here you can see it's even zippier than hash : 在这里,你可以看到它的,甚至比zippier hash

Unit: microseconds
                  expr      min        lq       mean    median        uq      max neval
      lookup_hash[[x]]   10.767   12.9070   22.67245   23.2915   26.1710   68.654   100
 test_lookup_list[[x]]  847.700  853.2545  887.55680  863.0060  893.8925 1369.395   100
     test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273   100
  test_lookup_env[[x]]    1.588    1.9450    4.61595    2.5255    6.6430   27.977   100

EDIT: 编辑:

Stepping through data.table:::`[.data.table` is instructive why you are seeing dt slow down. 单步执行data.table:::`[.data.table`data.table:::`[.data.table` ,为什么你看到dt放慢速度。 When you index with a character and there is a key set, it does quite a bit of bookkeeping, then drops down into bmerge , which is a binary search. 当你使用一个字符进行索引并且有一个键集时,它会进行相当多的簿记,然后进入bmerge ,这是一个二进制搜索。 Binary search is O(log n) and will get slower as n increases. 二进制搜索是O(log n),随着n的增加会变慢。

Environments, on the other hand, use hashing (by default) and have constant access time with respect to n. 另一方面,环境使用散列(默认情况下)并且相对于n具有恒定的访问时间。

To work around, you can manually build a map and index through it: 要解决此问题,您可以通过它手动构建地图和索引:

x <- lookup_tests[[2]][2]

e <- list2env(setNames(as.list(1:nrow(test_lookup_dt)), test_lookup_dt$product_id))

#example access:
test_lookup_dt[e[[x]], ]

However, seeing so much bookkeeping code in the data.table method, I'd try out plain old data.frames as well: 但是,在data.table方法中看到如此多的簿记代码,我也会尝试使用普通的旧数据框架:

test_lookup_df <- as.data.frame(test_lookup_dt)

rownames(test_lookup_df) <- test_lookup_df$product_id

If we are really paranoid, we could skip the [ methods altogether and lapply over the columns directly. 如果我们真的是偏执狂,我们可以完全跳过[方法并直接对列进行填充。

Here are some more timings (from a different machine than above): 这里有一些更多的时间(来自不同于上面的机器):

> microbenchmark(
+   test_lookup_dt[x,],
+   test_lookup_dt[x],
+   test_lookup_dt[e[[x]],],
+   test_lookup_df[x,],
+   test_lookup_df[e[[x]],],
+   lapply(test_lookup_df, `[`, e[[x]]),
+   lapply(test_lookup_dt, `[`, e[[x]]),
+   lookup_hash[[x]]
+ )
Unit: microseconds
                                expr       min         lq        mean     median         uq       max neval
                 test_lookup_dt[x, ]  1658.585  1688.9495  1992.57340  1758.4085  2466.7120  2895.592   100
                   test_lookup_dt[x]  1652.181  1695.1660  2019.12934  1764.8710  2487.9910  2934.832   100
            test_lookup_dt[e[[x]], ]  1040.869  1123.0320  1356.49050  1280.6670  1390.1075  2247.503   100
                 test_lookup_df[x, ] 17355.734 17538.6355 18325.74549 17676.3340 17987.6635 41450.080   100
            test_lookup_df[e[[x]], ]   128.749   151.0940   190.74834   174.1320   218.6080   366.122   100
 lapply(test_lookup_df, `[`, e[[x]])    18.913    25.0925    44.53464    35.2175    53.6835   146.944   100
 lapply(test_lookup_dt, `[`, e[[x]])    37.483    50.4990    94.87546    81.2200   124.1325   241.637   100
                    lookup_hash[[x]]     6.534    15.3085    39.88912    49.8245    55.5680   145.552   100

Overall, to answer your questions, you are not using data.table "wrong" but you are also not using it in the way it was intended (vectorized access). 总的来说,为了回答你的问题,你没有使用data.table“错误”,但你也没有按照预期的方式使用它(矢量化访问)。 However, you can manually build a map to index through and get most of the performance back. 但是,您可以手动构建映射以进行索引,并获得大部分性能。

The approach you have taken seems to be very inefficient because you are querying multiple times the single value from the dataset. 您采用的方法效率非常低,因为您要查询数据集中单个值的多倍。

It would be much more efficient to query all of them at once and then just loop on the whole batch, instead of querying 1e4 one by one. 一次查询所有这些,然后循环整个批处理,而不是一个一个地查询1e4会更有效。

See dt2 for a vectorized approach. 有关矢量化方法,请参阅dt2 Still it is hard for me to imagine the use case for that. 我仍然难以想象这个用例。

Another thing is 450K rows of data is quite few to make a reasonable benchmark, you may get totally different results for 4M or higher. 另一件事是450K行的数据很少能够制作出合理的基准,你可能会得到4M或更高的完全不同的结果。 In terms of hash approach you would probably also hit memory limits faster. 哈希方法而言,您可能还会更快地达到内存限制。

Additionally the Sys.time() may not be the best way to measure timing, read gc argument in ?system.time . 另外, Sys.time()可能不是测量时序的最佳方法,在?system.time读取gc参数。

Here is the benchmark I've made using the system.nanotime() function from microbenchmarkCore package. 这是我使用microbenchmarkCore包中的system.nanotime()函数制作的基准。

It is possible to speed up data.table approach even further by collapsing test_lookup_list into data.table and performing merge to test_lookup_dt , but to compare to hash solution I would also need to preprocess it. 通过将test_lookup_list折叠到data.table并执行merge到test_lookup_dt ,可以进一步加速data.table方法,但是为了与哈希解决方案进行比较,我还需要对其进行预处理。

library(microbenchmarkCore) # install.packages("microbenchmarkCore", repos="http://olafmersmann.github.io/drat")
library(data.table)
library(hash)

# Set seed to 42 to ensure repeatability
set.seed(42)

# Setting up test ------

# Generate product ids
product_ids = as.vector(
    outer(LETTERS[seq(1, 26, 1)],
          outer(outer(LETTERS[seq(1, 26, 1)], LETTERS[seq(1, 26, 1)], paste, sep=""),
                LETTERS[seq(1, 26, 1)], paste, sep = ""
          ), paste, sep = ""
    )
)

# Create test lookup data
test_lookup_list = lapply(product_ids, function(id) list(
    product_id = id,
    val_1 = rnorm(1),
    val_2 = rnorm(1),
    val_3 = rnorm(1),
    val_4 = rnorm(1),
    val_5 = rnorm(1),
    val_6 = rnorm(1),
    val_7 = rnorm(1),
    val_8 = rnorm(1)
))

# Set names of items in list
names(test_lookup_list) = sapply(test_lookup_list, `[[`, "product_id")

# Create lookup hash
lookup_hash = hash(names(test_lookup_list), test_lookup_list)

# Create data.table from list and set key of data.table to product_id field
test_lookup_dt <- rbindlist(test_lookup_list)
setkey(test_lookup_dt, product_id)

# Generate sample of keys to be used for speed testing
lookup_tests = lapply(1:10, function(x) sample(test_lookup_dt$product_id, 1e4))

native = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) test_lookup_list[[lookup]]))
dt1 = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) test_lookup_dt[lookup]))
hash = lapply(lookup_tests, function(lookups) system.nanotime(for(lookup in lookups) lookup_hash[[lookup]]))
dt2 = lapply(lookup_tests, function(lookups) system.nanotime(test_lookup_dt[lookups][, .SD, 1:length(product_id)]))

summary(sapply(native, `[[`, 3L))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  27.65   28.15   28.47   28.97   28.78   33.45
summary(sapply(dt1, `[[`, 3L))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  15.30   15.73   15.96   15.96   16.29   16.52
summary(sapply(hash, `[[`, 3L))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.1209  0.1216  0.1221  0.1240  0.1225  0.1426 
summary(sapply(dt2, `[[`, 3L))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#0.02421 0.02438 0.02445 0.02476 0.02456 0.02779

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM