简体   繁体   English

R - Data.table 基于快速二分搜索的子集,在第二个键中有多个值

[英]R - Data.table fast binary search based subset with multiple values in second key

I have come across this vignette at https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point .我在https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point遇到了这个小插曲。

My data looks like this:我的数据如下所示:

ID    TYPE     MEASURE_1    MEASURE_2
1     A        3            3
1     B        4            4
1     C        5            5
1     Mean     4            4
2     A        10           1
2     B        20           2
2     C        30           3
2     Mean     20           2

When I do this... all works as expected.当我这样做时......一切都按预期工作。

setkey(dt, ID, TYPE)
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
dt[.(unique(ID), "B")] # extract SD of all IDs with Type B
dt[.(unique(ID), "C")] # extract SD of all IDs with Type C

Whenever I try sth like this, where I want to base the keyed subset on multiple values for the second key , I only get the result of the all combinations of unique values in key 1 with only the first value defined in the vector c() for the second key.每当我尝试这样的事情时,我想将键控子集基于第二个键的多个值,我只得到键 1 中唯一值的所有组合的结果,只有向量c()中定义的第一个值为第二把钥匙。 So, it only takes the first value defined in the vector and ignores all following values.因此,它只采用向量中定义的第一个值并忽略所有后续值。

# extract SD of all IDs with one of the 3 types A/B/C    
dt[.(unique(ID), c("A", "B", "C")] 

# previous output is equivalent to 
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A

# I want/expect
dt[TYPE %in% c("A", "B", "C")]

What am I missing here or is this sth I cannot do with keyed subsets?我在这里错过了什么,或者这是我不能用键控子集做的事情?

To clarify: As I cannot leave out the key 1 in keyed subsets, the vignette calls for inclusion of the first key with unique(key1)澄清一下:由于我不能在键控子集中遗漏键 1,因此小插图要求包含具有unique(key1)

And defining multiple keys in key 1 works also as expected.并且在键 1 中定义多个键也可以按预期工作。

dt[.(c(1, 2), "A")] == dt[ID %in% c(1,2) & TYPE == "A"] # TRUE

In the data.table documention (see help("data.table") or https://rdatatable.gitlab.io/data.table/reference/data.table.html#arguments ), it is mentioned:在 data.table 文档中(参见help("data.table")https://rdatatable.gitlab.io/data.table/,#arment. )

character, list and data.frame input to i is converted into a data.table internally using as.data.table.输入到 i 的字符、列表和数据帧在内部使用 as.data.table 转换为 data.table。

So, the classical recycling rule used in R (or in data.frame ) applies.因此,在 R (或data.frame )中使用的经典回收规则适用。 That is, .(unique(ID), c("A", "B", "C")) , which is equivalent to list(unique(ID), c("A", "B", "C")) , becomes:.(unique(ID), c("A", "B", "C")) ,相当于list(unique(ID), c("A", "B", "C")) ,变为:

as.data.table(list(unique(ID), c("A", "B", "C")))

and since the length of the longest list element (length of c("A", "B", "C") ) is not a multiple of the shorter one (length of unique(ID) ), you will get an error.并且由于最长列表元素的长度( c("A", "B", "C")的长度)不是较短的( unique(ID)的长度)的倍数,因此您将收到错误消息。 If you want each value in unique(ID) combined with each element in c("A", "B", "C") , you should use CJ(unique(ID), c("A", "B", "C")) instead.如果您希望unique(ID)中的每个值与c("A", "B", "C")中的每个元素相结合,则应使用CJ(unique(ID), c("A", "B", "C"))代替。

So what you should do is dt[CJ(unique(ID), c("A", "B", "C"))] .所以你应该做的是dt[CJ(unique(ID), c("A", "B", "C"))]

Note that dt[.(unique(ID), "A")] works correctly because you passed only one element for the second key and this gets recycled to match the length of unique(ID) .请注意, dt[.(unique(ID), "A")]可以正常工作,因为您只为第二个键传递了一个元素,并且这会被回收以匹配unique(ID)的长度。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 基于二进制搜索的data.table中NA值的子集 - Binary search based subset on NA values in data.table data.table是否基于二进制搜索实现快速范围子集? 那是什么语法? - Does data.table implement fast range subsetting based on binary search? What is that syntax? 使用`data.table`包在R中使用键的子集数据 - subset data with key in R using `data.table` package 子集 data.table 基于键不是列表的元素 - subset data.table based on key being NOT an element of a list 在R中的`data.table`中,是否有一种方法可以基于索引将值快速分配给行? - In `data.table` in R, is there a way to fast-assign values to rows based on an index? R中的data.table-使用多个键的多个过滤器-二进制搜索 - data.table in R - multiple filters using multiple keys - binary search 编写一个函数,根据在 R 中 data.table 的第二列中的搜索来更改一列中的值 - Writing a function that changes the value in one column based on search in second column in data.table in R 如何通过 R 中的多个列将 function 应用于 data.table 子集? - How to apply a function to a data.table subset by multiple columns in R? 总结 data.table - 在 R 中按日期创建多个列子集 - summarizing data.table - creating multiple columns subset by date in R data.table:如何对一个键的两个(数字)值进行二进制搜索:包括示例 - data.table: How to do binary search for two (numeric) values at one key: example included
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM