简体   繁体   中英

Is there a way in data.table to collapse/summarize variables using objects, a la .SDcols?

I have a dataset with multiple variables that I want to collapse/summarize (depending on whether your background is more Stata or tidyverse) using data.table without having to name each individual variable in the code that does this.

Here's some example data:

library(data.table)

dt <- data.table(v1 = c(1,2,5,8,5,9, NA),
                 v2 = c(5,3,6,1, NA,7,8),
                 year = c(1,1,2,3,3,3,4))

dt
    v1 v2 year
1:  1  5    1
2:  2  3    1
3:  5  6    2
4:  8  1    3
5:  5 NA    3
6:  9  7    3
7: NA  8    4

And here's the final dataset I want:

# this is the final data we want:
dt[, .(newv1 = sum(v1, na.rm = TRUE),
       newv2 = sum(v2, na.rm = TRUE)),
   by = 'year']

    year newv1 newv2
1:    1     3     8
2:    2     5     6
3:    3    22     8
4:    4     0     8

The actual dataset has many variables I'd like to summarize, so I want to name the variables in a list outside of the collapsing beforehand and then name them in a systematic way, such as:

# but we want to do it with objects e.g.:
vars.to.collapse <- c('v1', 'v2')
new.v.names <- paste0('new', vars.to.collapse)

new.v.names
[1] "newv1" "newv2"

I know you can do this kind of thing--that is, create multiple variables using objects--using .SD when you're adding/modifying variables using the := operator, but I have not been able to find a way to do this while altering the unit of observation. In an ideal world I'd use code something like the following:

# want something like this but it doesn't work:
newdt <- dt[, .( (new.v.names = sum(.SD, na.rm = TRUE))),
              .SDcols = vars.to.collapse,
              by = 'year']

# not what I want:
newdt
     year V1
1:    1 11
2:    2 11
3:    3 30
4:    4  8

But this doesn't produce the dataset I showed you above with the new variable names and all. I could work around this by using the := operator to add these statistics to the dataset and then drop duplicates or something, but I'd prefer to summarize directly if possible.

To follow up on your comment about naming new variables, I suggest two solutions using the data.table library.

So, please find below the two reprex.

Reprex 1 (as a follow-up to the solution proposed by @Roland)

  • Code
library(data.table)

vars.to.collapse <- c('v1', 'v2') # your code
new.v.names <- paste0('new', vars.to.collapse) # your code

dt <- dt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = vars.to.collapse, by = 'year'] # Roland's code

setnames(dt, c("year", new.v.names))
  • Output
dt
#>     year newv1 newv2
#> 1:     1     3     8
#> 2:     2     5     6
#> 3:     3    22     8
#> 4:     4     0     8

Reprex 2

  • Code
library(data.table)

vars.to.collapse <- c('v1', 'v2')  # your code
new.v.names <- paste0('new', vars.to.collapse) # your code

dt[, lapply(.SD, sum, na.rm = TRUE), .SDcols = vars.to.collapse, by = 'year'
   ][, (new.v.names) := .SD, .SDcols = vars.to.collapse
     ][, .SD, .SDcols = !patterns("^v")][]
  • Output
#>    year newv1 newv2
#> 1:    1     3     8
#> 2:    2     5     6
#> 3:    3    22     8
#> 4:    4     0     8

Created on 2021-11-17 by the reprex package (v2.0.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM