[英]R - Data.table fast binary search based subset with multiple values in second key
I have come across this vignette at https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point .我在https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point遇到了这个小插曲。
My data looks like this:我的数据如下所示:
ID TYPE MEASURE_1 MEASURE_2
1 A 3 3
1 B 4 4
1 C 5 5
1 Mean 4 4
2 A 10 1
2 B 20 2
2 C 30 3
2 Mean 20 2
When I do this... all works as expected.当我这样做时......一切都按预期工作。
setkey(dt, ID, TYPE)
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
dt[.(unique(ID), "B")] # extract SD of all IDs with Type B
dt[.(unique(ID), "C")] # extract SD of all IDs with Type C
Whenever I try sth like this, where I want to base the keyed subset on multiple values for the second key , I only get the result of the all combinations of unique values in key 1 with only the first value defined in the vector c()
for the second key.每当我尝试这样的事情时,我想将键控子集基于第二个键的多个值,我只得到键 1 中唯一值的所有组合的结果,只有向量c()
中定义的第一个值为第二把钥匙。 So, it only takes the first value defined in the vector and ignores all following values.因此,它只采用向量中定义的第一个值并忽略所有后续值。
# extract SD of all IDs with one of the 3 types A/B/C
dt[.(unique(ID), c("A", "B", "C")]
# previous output is equivalent to
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
# I want/expect
dt[TYPE %in% c("A", "B", "C")]
What am I missing here or is this sth I cannot do with keyed subsets?我在这里错过了什么,或者这是我不能用键控子集做的事情?
To clarify: As I cannot leave out the key 1 in keyed subsets, the vignette calls for inclusion of the first key with unique(key1)
澄清一下:由于我不能在键控子集中遗漏键 1,因此小插图要求包含具有unique(key1)
And defining multiple keys in key 1 works also as expected.并且在键 1 中定义多个键也可以按预期工作。
dt[.(c(1, 2), "A")] == dt[ID %in% c(1,2) & TYPE == "A"] # TRUE
In the data.table documention (see help("data.table")
or https://rdatatable.gitlab.io/data.table/reference/data.table.html#arguments ), it is mentioned:在 data.table 文档中(参见help("data.table")
或https://rdatatable.gitlab.io/data.table/,#arment. )
character, list and data.frame input to i is converted into a data.table internally using as.data.table.输入到 i 的字符、列表和数据帧在内部使用 as.data.table 转换为 data.table。
So, the classical recycling rule used in R (or in data.frame
) applies.因此,在 R (或data.frame
)中使用的经典回收规则适用。 That is, .(unique(ID), c("A", "B", "C"))
, which is equivalent to list(unique(ID), c("A", "B", "C"))
, becomes:即.(unique(ID), c("A", "B", "C"))
,相当于list(unique(ID), c("A", "B", "C"))
,变为:
as.data.table(list(unique(ID), c("A", "B", "C")))
and since the length of the longest list element (length of c("A", "B", "C")
) is not a multiple of the shorter one (length of unique(ID)
), you will get an error.并且由于最长列表元素的长度( c("A", "B", "C")
的长度)不是较短的( unique(ID)
的长度)的倍数,因此您将收到错误消息。 If you want each value in unique(ID)
combined with each element in c("A", "B", "C")
, you should use CJ(unique(ID), c("A", "B", "C"))
instead.如果您希望unique(ID)
中的每个值与c("A", "B", "C")
中的每个元素相结合,则应使用CJ(unique(ID), c("A", "B", "C"))
代替。
So what you should do is dt[CJ(unique(ID), c("A", "B", "C"))]
.所以你应该做的是dt[CJ(unique(ID), c("A", "B", "C"))]
。
Note that dt[.(unique(ID), "A")]
works correctly because you passed only one element for the second key and this gets recycled to match the length of unique(ID)
.请注意, dt[.(unique(ID), "A")]
可以正常工作,因为您只为第二个键传递了一个元素,并且这会被回收以匹配unique(ID)
的长度。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.