简体   繁体   中英

More efficient way of using group_by > mutate > slice

I have a dataframe that looks like this

df <- data.frame("Month" = c("April","April","May","May","June","June","June"),
"ID" = c(11, 11, 12, 10, 11, 11, 11),
"Region" = c("East", "West", "North", "East", "North" ,"East", "West"),
"Qty" = c(120, 110, 110, 110, 100, 90, 70),
"Sales" = c(1000, 1100, 900, 1000, 1000, 800, 650),
"Leads" = c(10, 12, 9, 8, 6, 5, 4))

Month   ID     Region    Qty    Sales   Leads
April   11     East      120    1000    10
April   11     West      110    1100    12
May     12     North     110    900     9
May     10     East      110    1000    8
June    11     North     100    1000    6
June    11     East      90     800     5
June    11     West      70     650     4

I want a dataframe that looks like this

Month   ID     Qty     Sales   Leads   Region
April   11     230     2100    22      East
May     12     110     900     9       North
May     10     110     1000    8       East
June    11     260     2450    15      North

I am using a the following code

result <- df %>% group_by(Month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1) 

result$Region <- NULL

I have over 2 million such rows and it is taking forever to calculate the aggregate.

I am using mutate and slice instead of summarize because the df is arranged in a certain way and I want to retain the Region in that first row.

However I think there could be a more efficient way. Please help on both. Can't figure it out for the life of me.

summarize makes more sense to me than mutate and slice . This should save you some time.

library(dplyr)
result <- df %>%
  group_by(Month, ID) %>%
  summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
            Region = first(Region))
result
# # A tibble: 4 x 6
# # Groups:   Month [3]
#   Month    ID   Qty Sales Leads Region
#   <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
# 1 April    11   230  2100    22 East  
# 2 June     11   260  2450    15 North 
# 3 May      10   110  1000     8 East  
# 4 May      12   110   900     9 North 

solution 1

We can apply generic speed-up strategies:

  1. Do less
  2. Use appropriate data structures

In this example, there is much overhead because of type checking. We could rewrite the code slightly to be more efficient by using the collapse package, which provides a C++ interface to dplyr functions.

library(collapse)
df |>
    fgroup_by(Month, ID) |>
    fsummarise(Qty = fsum(Qty),
               Sales = fsum(Sales),
               Leads = fsum(Leads),
               Region = fsubset(Region, 1L),
               keep.group_vars = T) |>
    as_tibble() # optional
#> # A tibble: 4 x 6
#>   Month    ID   Qty Sales Leads Region
#>   <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
#> 1 April    11   230  2100    22 East  
#> 2 June     11   260  2450    15 North 
#> 3 May      10   110  1000     8 East  
#> 4 May      12   110   900     9 North 

Where |> is a slightly faster pipe than %>% , and collapse package appended with f provide a faster alternative (implemented in C++) to dplyr functions.

solution 2

In addition to www's approach, you will likely get significant speed up when working with large data.frames if you swap to a data.table backend. The easiest conversion for this would be using the dtplyr package, which ships with tidyverse . We can convert it by adding two lines of code.

library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
      group_by(Month, ID) %>%
      summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
                Region = first(Region)) %>%
      as_tibble() # or data.table()

Note that this results is an ungrouped data.frame at the end.

Benchmarks

Approaches are wrapped in functions for conciseness. dplyr here is www's rewriting of OP's approach.

bench::mark(collapse = collapse(df), dplyr = dplyr(df), dtplyr = dtplyr(df),
            time_unit = "ms", iterations = 200)[c(3,5,7)]
   median mem_alloc n_itr
   <dbl> <bch:byt> <int>
1  0.327        0B   199
2  5.51     8.73KB   196
3  7.21    75.89KB   196

We can see that collapse is more memory efficient, and significantly faster compared to dplyr . dtplyr approach is included here, as its time complexity is different than that of dplyr .

Further optimizations are still possible - for example by converting numeric columns into integer columns.

side-note

The Region column loses information in OP's required approach, IE East and North for April and June in ID 11 is misleading.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM