简体   繁体   中英

data.table: How to do binary search for two (numeric) values at one key: example included

The example data:

library(data.table)
DT <- data.table(a = c(1, 3, 5, 9, 15), 
                 b = c("a", "c", "d", "e", "f"))

I would like to get two rows which is a == 3 | a == 9 a == 3 | a == 9 , that is

# a b
# 3 c
# 9 e

I know if I do: DT[, a:=as.character(a)] then setkey(DT, a) and DT[c("3", "9")] , I can get the wanted result. But I would like to know, if there are other methods to do this kind of binary search without changing a to character in advance?

First, you don't have to convert to a character column every time before to perform a join/binary search based subset. You can use J() and pass an integer / numeric / character / logical / bit64::integer64 vector to it, like so:

DT[J(vec1, vec2, ...)]

where, vec1 will be matched against the first key column and vec2 against the second key column and so on.

The fact that you don't have to add a J() for character types is an additional feature , just for convenience. Because a integer/numeric/logical vector already has a meaning as such - DT[1] would return the first row, we can't provide the same shortcut for those types. Hope this answers your original question.

Coming back to your question, to subset column a with values (3,9) , you can do it using data.table 's binary search based subset:

require(data.table)
setkey(DT, a)
DT[J(c(3,9))] ## alternatively DT[.(c(3,9))] in 1.9.4+
#    a b
# 1: 3 c
# 2: 9 e

There are two things to note, for you to fast subset using data.table 's binary search feature:

  • It requires that you sort your entire data (which may not be always desirable).
  • On very large datasets, the time to rearrange the data in memory could be time consuming (finding the order itself is usually much cheaper than moving data around).

To address these issues and to provide better functionality, data.table addresses this problem in 1.9.4 by introducing a new experimental feature - automatic indexing, with the help of secondary keys . Matt has implemented this in 1.9.4.

What automatic indexing does is, if a secondary key doesn't already exist, on the first run of an expression that data.table understands (at the moment) can be optimised, a secondary key will be created. It just stores the order of this column using data.table's fast radix ordering, and stores it as an attribute. There's no reordering of the data at all, unlike setkey . You can also set the secondary key using set2key() .

The first time you run it, the time taken is equal to a) time for secondary key (usually very small), and b) time for the query. And from second time on, it's just the time to query, and that's fast using binary search.

If you query another column with an expression that data.table understands now, then it'll additionally set a secondary key for that column as well, the first time it's run. And so on...

There should be no (noticeable) differences in speed between the two methods (once setkey and set2key are done). See example below.

The concept of secondary keys can be extended beyond automatic indexing, to joins as well. This will speedup data.table joins even further.


Here's an example. I'll use 1.9.5, as Matt's already fixed some bugs in automatic indexing.

require(data.table) ## 1.9.5+
set.seed(45L)
DT = data.table(x=sample(1e3, 5e7, TRUE))[, y := x]
setkey(DT, x)
set2key(DT, y)

Note that after setkey(.) DT will be reordered, by reference. But set2key would just set an attribute, and therefore your data wouldn't be reordered based on y's order.

The columns x and y are identical (on purpose). Let's test both:

system.time(DT[J(100L)])   ## on column x, 0.003 seconds
system.time(DT[y == 100L]) ## on column y, 0.003 seconds, uses secondary keys

identical(DT[J(100L)], DT[y==100L]) # [1] TRUE

How much time does it take with a vector scan?

options(datatable.auto.index = FALSE)
system.time(DT[y == 100L]) ## 0.428 seconds

You don't need to convert it into a character vector (although integer would make more sense)

 DT <- data.table(a = c(1, 3, 5, 9, 15), b = c("a", "c", "d", "e", "f"))
 setkey(DT, a)
 DT[J(c(3, 9))]

Moreover, if you have the latest version in CRAN, the second time you use a in i will automatically uses binary search

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM