简体   繁体   中英

R - data.table not grouping when using with

Update - it seems that with = F is incompatible with expressions in j and also with (at least some) by = situations.

Taking the scenario below and simplifying it as much as possible:

dt <- data.table(group1 = c("a", "a", "a", "b", "b", "b"),
                 group2 = c("x", "x", "y", "y", "z", "z"),
                 data = c(rep(T, 3), rep(F, 3)))

dt[
  ,
  3,
  with = F,
  by = list(group1, group2)
]

    data
1:  TRUE
2:  TRUE
3:  TRUE
4: FALSE
5: FALSE
6: FALSE
> 

dt[
  ,
  data,
  by = list(group1, group2)
]

   group1 group2  data
1:      a      x  TRUE
2:      a      x  TRUE
3:      a      y  TRUE
4:      b      y FALSE
5:      b      z FALSE
6:      b      z FALSE
>

The expression behavior is documented in a roundabout way in ?data.table :

A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) a vector of names or positions to select .

I don't see any documentation of with = F disabling by = in the documentation, but it seems that in this case it does.


I'm having an issue where data.table either uses or ignores by = depending on whether I use with = F .

library(data.table)

dt <- data.table(group1 = c("a", "a", "a", "b", "b", "b"),
                 group2 = c("x", "x", "y", "y", "z", "z"),
                 data = c(rep(T, 3), rep(F, 3)))

# without with = F

dt[
  as.vector(!is.na(dt[, 3, with = F])),
  sum(data),
  by = list(group1, group2)
]
>
   group1 group2 V1
1:      a      x  2
2:      a      y  1
3:      b      y  0
4:      b      z  0 

# with = F

dt[
  as.vector(!is.na(dt[, 3, with = F])),
  sum(3),
  with = F,
  by = list(group1, group2)
]
>
    data
1:  TRUE
2:  TRUE
3:  TRUE
4: FALSE
5: FALSE
6: FALSE

I've tried using a vector of numbers, and a vector of characters for by = , neither work.

sum() is an example function, I have the same basic issue when I don't use a function on j .

In the end, I need to use with = F to iterate across multiple columns of the data.table in a for loop.

Any suggestions?

A good rule of thumb for data with named columns is - never use column numbers - columns get rearranged sometimes and that can leave your code completely broken. Of course for any rule of thumb there are exceptions, but you'll need to demonstrate that your case is worth an exception, so I'll assume it's not for now.

So, if you're typing the code you'd do:

dt[!is.na(data), sum(data), by = .(group1, group2)]

And if you have the column name instead in a variable, you'd do:

col = "data"
dt[!is.na(get(col)), sum(get(col)), by = .(group1, group2)]

As for using by together with with = FALSE - that mode is designed for compatibility with data.frame , which doesn't have a by argument, but even if you had support for the by argument, the result would be trivial since the j-expression will always be interpreted as a full column in with = FALSE mode (just as in data.frame ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM