什么是对data.table进行子集化的最快方法？

Question

It seems to me the fastest way to do a row/col subset of a data.table is to use the join and nomatch option. 在我看来，执行data.table / data.table子行的最快方法是使用join和nomatch选项。

Is this correct? 这个对吗？

DT = data.table(rep(1:100, 100000), rep(1:10, 1000000))
setkey(DT, V1, V2)
system.time(DT[J(22,2), nomatch=0L])
# user  system elapsed 
# 0.00    0.00    0.01 
system.time(subset(DT, (V1==22) & (V2==2)))
# user  system elapsed 
# 0.45    0.21    0.67 

identical(DT[J(22,2), nomatch=0L],subset(DT, (V1==22) & (V2==2)))
# [1] TRUE

I also have one problem with the fast join based on binary search: I cannot find a way to select all items in one dimension. 基于二进制搜索的快速连接也存在一个问题：我找不到在一个维度中选择所有项目的方法。

Say if I want to subsequently do: 如果我想要随后说：

DT[J(22,2), nomatch=0]  # subset on TWO dimensions
DT[J(22,), nomatch=0]   # subset on ONE dimension only
# Error in list(22, ) : argument 2 is empty

without having to re-set the key to only one dimension (because I am in a loop and I don't want to rest the keys every time). 无需将密钥重新设置为只有一个维度（因为我处于循环中，我不想每次都关闭密钥）。

Answer 1

What's the fastest way to subset a `data.table` ? 什么是对`data.table`进行子集化的最快方法？

Using the binary search based subset feature is the fastest. 使用基于二进制搜索的子集功能是最快的。 Note that the subset requires the option nomatch = 0L so as to return only the matching results. 请注意，子集需要选项nomatch = 0L以便仅返回匹配结果。

How to subset by one of the keys only with two keys set? 如何通过其中一个键只用两个键设置子集？

If you've two keys set on DT and you want to subset by the first key , then you can just provide the first value in J(.) , no need to provide anything for the 2nd key. 如果您在DT设置了两个键，并且您希望按第一个键进行子集，那么您只需提供J(.)的第一个值，无需为第二个键提供任何内容。 That is: 那是：

# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L]

If instead, you would like to subset by the second key , then you'll have to, as of now, provide all the unique values for the first key. 相反，如果您希望按第二个键进行子集，那么到目前为止，您必须为第一个键提供所有唯一值。 That is: 那是：

# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]

This is also shown in this SO post . 这也在SO帖子中显示。 Although I'd prefer that DT[J(, 2)] to work for this case, as that seems rather intuitive. 虽然我更喜欢DT[J(, 2)]适用于这种情况，因为这似乎相当直观。

There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this. 还有一个待处理的功能请求， FR＃1007用于实现二级密钥，完成后将处理此问题。

Here is a better example: 这是一个更好的例子：

DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
#    V1 V2
# 1:  1  2
# 2:  2  3
# 3:  3  2
# 4:  4  3
# 5:  5  2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
#    V1 V2
# 1:  1  2
# 2:  2  2
# 3:  3  2
# 4:  4  2
# 5:  5  2
DT[J(unique(V1),2), nomatch=0L]
#    V1 V2
# 1:  1  2
# 2:  3  2
# 3:  5  2
DT[J(3), nomatch=0L]
#    V1 V2
# 1:  3  2

In summary: 综上所述：

# key(DT) = c("V1", "V2")

# data.frame                        |             data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) |            DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3))             |              DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2))             |  DT[J(unique(V1), 2), nomatch=0L]

什么是对data.table进行子集化的最快方法？

问题描述

1 个解决方案

解决方案1
12 已采纳 2014-05-20 09:53:28

What's the fastest way to subset a `data.table` ? 什么是对`data.table`进行子集化的最快方法？

How to subset by one of the keys only with two keys set? 如何通过其中一个键只用两个键设置子集？

什么是对data.table进行子集化的最快方法？

问题描述

1 个解决方案

解决方案1 12 已采纳 2014-05-20 09:53:28

What's the fastest way to subset a data.table ? 什么是对data.table进行子集化的最快方法？

How to subset by one of the keys only with two keys set? 如何通过其中一个键只用两个键设置子集？

解决方案1
12 已采纳 2014-05-20 09:53:28

What's the fastest way to subset a `data.table` ? 什么是对`data.table`进行子集化的最快方法？