简体   繁体   English

有没有更快的方法来按组嵌套数据?

[英]Is there a speedier way to nest data by-group?

I'm looking for a fast way to summarize a dataset:我正在寻找一种快速总结数据集的方法:

  • by-group按组
  • collapsing rows by nesting values通过嵌套值折叠行

For example, consider the mpg dataset from ggplot2 package.例如,考虑来自ggplot2包的mpg数据集。 Here we group by manufacturer and year , and nest cty values.在这里,我们按manufactureryear以及嵌套cty值进行分组。

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)

mpg %>%
  group_by(manufacturer, year) %>%
  summarise(nest_cty = list(cty))
#> `summarise()` has grouped output by 'manufacturer'. You can override using the `.groups` argument.
#> # A tibble: 30 x 3
#> # Groups:   manufacturer [15]
#>    manufacturer  year nest_cty  
#>    <chr>        <int> <list>    
#>  1 audi          1999 <int [9]> 
#>  2 audi          2008 <int [9]> 
#>  3 chevrolet     1999 <int [7]> 
#>  4 chevrolet     2008 <int [12]>
#>  5 dodge         1999 <int [16]>
#>  6 dodge         2008 <int [21]>
#>  7 ford          1999 <int [15]>
#>  8 ford          2008 <int [10]>
#>  9 honda         1999 <int [5]> 
#> 10 honda         2008 <int [4]> 
#> # ... with 20 more rows

There are alternative packages that can give us some performance gain in speed:有一些替代软件包可以让我们在速度上获得一些性能提升:

  • data.table

     library(ggplot2) library(data.table) MPG <- data.table(mpg) MPG[, .(nest_cty = list(cty)), by = list(manufacturer, year)]
  • collapse

     library(ggplot2) library(collapse, warn.conflicts = FALSE) #> collapse 1.6.5, see ?`collapse-package` or ?`collapse-documentation` #> Note: stats::D -> D.expression, D.call, D.name library(magrittr) mpg %>% fgroup_by(manufacturer, year) %>% fsummarise(nest_cty = list(cty))

Evidently, collapse gives the speediest execution time:显然, collapse给出了最快的执行时间:

bench::mark(dplyr = mpg %>% group_by(manufacturer, year) %>% summarise(nest_cty = list(cty)),
            data.table = {MPG <- data.table(mpg); MPG[, .(nest_cty = list(cty)), by = list(manufacturer, year)] },
            collapse = mpg %>% fgroup_by(manufacturer, year) %>% fsummarise(nest_cty = list(cty)),
            check = FALSE)
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr        4.69ms    5.5ms      184.    2.38MB     5.56
#> 2 data.table   2.37ms   2.51ms      391.    2.16MB     0   
#> 3 collapse     95.2us  101.8us     9560.  206.56KB     6.22

The problem问题

When we have very large data, even collapse is not as speedy as I wish当我们有非常大的数据时,即使collapse也没有我希望的那么快
Below I simulate some large data and show timings.下面我模拟一些大数据并显示时序。

code for data simulation function;数据模拟功能代码; no need to read无需阅读

library(ids)
library(tidyr)

simulate_data <- function() {
  
  # util function
  generate_random_unique_ids <- function(n) {
    vec_ids <- ids::random_id(n = n, bytes = 4, use_openssl = FALSE)
    repeat {
      duplicates <- duplicated(vec_ids)
      if (!any(duplicates)) {
        break
      }
      vec_ids[duplicates] <- ids::random_id(n = sum(duplicates), bytes = 4, use_openssl = FALSE)
    }
    vec_ids
  }
  
  ids        <- generate_random_unique_ids(4e5)
  years      <- 1996:2022
  
  d         <- tidyr::expand_grid(ids, years)
  d$weights <- sample(5, nrow(d), replace = TRUE)
  d         <- tidyr::uncount(d, weights = weights) 
  d$vals    <- sample(letters, nrow(d), replace = TRUE)
  d         <- d[-sample(nrow(d), 0.5 * nrow(d), replace = TRUE), ]
  
  return(d)
}

simulate the data模拟数据

set.seed(2021)
my_df <- simulate_data()
my_df
#> # A tibble: 19,650,398 x 3
#>    ids      years vals 
#>    <chr>    <int> <chr>
#>  1 86a21d40  1996 c    
#>  2 86a21d40  1996 h    
#>  3 86a21d40  1996 v    
#>  4 86a21d40  1997 m    
#>  5 86a21d40  1997 f    
#>  6 86a21d40  1997 g    
#>  7 86a21d40  1997 b    
#>  8 86a21d40  1998 i    
#>  9 86a21d40  1998 d    
#> 10 86a21d40  1998 j    
#> # ... with 19,650,388 more rows

We can see that my_df contains 3 columns in long format.我们可以看到my_df包含 3 列长格式。


Bottom line:底线:
Even with collapse , grouping by ids and years then nesting vals is going to take quite some time.即使使用collapse ,按idsyears分组然后嵌套vals也需要相当长的时间。

library(collapse)
library(tictoc)

tic()
output <-
  my_df %>%
  fgroup_by(ids, years) %>%
  fsummarise(nested_vals = list(vals))
toc()
#> 20.15 sec elapsed
output
#> # A tibble: 9,411,683 x 3
#>    ids      years nested_vals
#>    <chr>    <int> <list>     
#>  1 00000466  1996 <chr [3]>  
#>  2 00000466  1997 <chr [1]>  
#>  3 00000466  1998 <chr [1]>  
#>  4 00000466  1999 <chr [1]>  
#>  5 00000466  2001 <chr [1]>  
#>  6 00000466  2002 <chr [4]>  
#>  7 00000466  2003 <chr [1]>  
#>  8 00000466  2004 <chr [3]>  
#>  9 00000466  2005 <chr [4]>  
#> 10 00000466  2006 <chr [1]>  
#> # ... with 9,411,673 more rows

I tend to work with very large data, and processing times easily climb for such "nesting" procedure.我倾向于处理非常大的数据,对于这种“嵌套”过程,处理时间很容易攀升。 Is there possibly a speedier method than the collapse example?是否有比collapse示例更快的方法?

An option with collap一个collap选项

library(collapse)
output <- collap(my_df, ~ ids + years, custom = list(list = 3))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM