What's the fastest way to subset a data.table?

Question

It seems to me the fastest way to do a row/col subset of a data.table is to use the join and nomatch option.

Is this correct?

DT = data.table(rep(1:100, 100000), rep(1:10, 1000000))
setkey(DT, V1, V2)
system.time(DT[J(22,2), nomatch=0L])
# user  system elapsed 
# 0.00    0.00    0.01 
system.time(subset(DT, (V1==22) & (V2==2)))
# user  system elapsed 
# 0.45    0.21    0.67 

identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2)))
# [1] TRUE

I also have one problem with the fast join based on binary search: I cannot find a way to select all items in one dimension.

Say if I want to subsequently do:

DT[J(22,2), nomatch=0]  # subset on TWO dimensions
DT[J(22,), nomatch=0]   # subset on ONE dimension only
# Error in list(22, ) : argument 2 is empty

without having to re-set the key to only one dimension (because I am in a loop and I don't want to rest the keys every time).

Answer 1

What's the fastest way to subset a `data.table` ?

Using the binary search based subset feature is the fastest. Note that the subset requires the option nomatch = 0L so as to return only the matching results.

How to subset by one of the keys only with two keys set?

If you've two keys set on DT and you want to subset by the first key , then you can just provide the first value in J(.) , no need to provide anything for the 2nd key. That is:

# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L]

If instead, you would like to subset by the second key , then you'll have to, as of now, provide all the unique values for the first key. That is:

# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]

This is also shown in this SO post . Although I'd prefer that DT[J(, 2)] to work for this case, as that seems rather intuitive.

There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this.

Here is a better example:

DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
#    V1 V2
# 1:  1  2
# 2:  2  3
# 3:  3  2
# 4:  4  3
# 5:  5  2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
#    V1 V2
# 1:  1  2
# 2:  2  2
# 3:  3  2
# 4:  4  2
# 5:  5  2
DT[J(unique(V1),2), nomatch=0L]
#    V1 V2
# 1:  1  2
# 2:  3  2
# 3:  5  2
DT[J(3), nomatch=0L]
#    V1 V2
# 1:  3  2

In summary:

# key(DT) = c("V1", "V2")

# data.frame                        |             data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) |            DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3))             |              DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2))             |  DT[J(unique(V1), 2), nomatch=0L]

What's the fastest way to subset a data.table?

Question

1 answers

solution1
12 ACCPTED 2014-05-20 09:53:28

What's the fastest way to subset a `data.table` ?

How to subset by one of the keys only with two keys set?

What's the fastest way to subset a data.table?

Question

1 answers

solution1 12 ACCPTED 2014-05-20 09:53:28

What's the fastest way to subset a data.table ?

How to subset by one of the keys only with two keys set?

solution1
12 ACCPTED 2014-05-20 09:53:28

What's the fastest way to subset a `data.table` ?