简体   繁体   中英

R - Data.table fast binary search based subset with multiple values in second key

I have come across this vignette at https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point .

My data looks like this:

ID    TYPE     MEASURE_1    MEASURE_2
1     A        3            3
1     B        4            4
1     C        5            5
1     Mean     4            4
2     A        10           1
2     B        20           2
2     C        30           3
2     Mean     20           2

When I do this... all works as expected.

setkey(dt, ID, TYPE)
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
dt[.(unique(ID), "B")] # extract SD of all IDs with Type B
dt[.(unique(ID), "C")] # extract SD of all IDs with Type C

Whenever I try sth like this, where I want to base the keyed subset on multiple values for the second key , I only get the result of the all combinations of unique values in key 1 with only the first value defined in the vector c() for the second key. So, it only takes the first value defined in the vector and ignores all following values.

# extract SD of all IDs with one of the 3 types A/B/C    
dt[.(unique(ID), c("A", "B", "C")] 

# previous output is equivalent to 
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A

# I want/expect
dt[TYPE %in% c("A", "B", "C")]

What am I missing here or is this sth I cannot do with keyed subsets?

To clarify: As I cannot leave out the key 1 in keyed subsets, the vignette calls for inclusion of the first key with unique(key1)

And defining multiple keys in key 1 works also as expected.

dt[.(c(1, 2), "A")] == dt[ID %in% c(1,2) & TYPE == "A"] # TRUE

In the data.table documention (see help("data.table") or https://rdatatable.gitlab.io/data.table/reference/data.table.html#arguments ), it is mentioned:

character, list and data.frame input to i is converted into a data.table internally using as.data.table.

So, the classical recycling rule used in R (or in data.frame ) applies. That is, .(unique(ID), c("A", "B", "C")) , which is equivalent to list(unique(ID), c("A", "B", "C")) , becomes:

as.data.table(list(unique(ID), c("A", "B", "C")))

and since the length of the longest list element (length of c("A", "B", "C") ) is not a multiple of the shorter one (length of unique(ID) ), you will get an error. If you want each value in unique(ID) combined with each element in c("A", "B", "C") , you should use CJ(unique(ID), c("A", "B", "C")) instead.

So what you should do is dt[CJ(unique(ID), c("A", "B", "C"))] .

Note that dt[.(unique(ID), "A")] works correctly because you passed only one element for the second key and this gets recycled to match the length of unique(ID) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM