I have a dataframe that looks like this
df <- data.frame("Month" = c("April","April","May","May","June","June","June"),
"ID" = c(11, 11, 12, 10, 11, 11, 11),
"Region" = c("East", "West", "North", "East", "North" ,"East", "West"),
"Qty" = c(120, 110, 110, 110, 100, 90, 70),
"Sales" = c(1000, 1100, 900, 1000, 1000, 800, 650),
"Leads" = c(10, 12, 9, 8, 6, 5, 4))
Month ID Region Qty Sales Leads
April 11 East 120 1000 10
April 11 West 110 1100 12
May 12 North 110 900 9
May 10 East 110 1000 8
June 11 North 100 1000 6
June 11 East 90 800 5
June 11 West 70 650 4
I want a dataframe that looks like this
Month ID Qty Sales Leads Region
April 11 230 2100 22 East
May 12 110 900 9 North
May 10 110 1000 8 East
June 11 260 2450 15 North
I am using a the following code
result <- df %>% group_by(Month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1)
result$Region <- NULL
I have over 2 million such rows and it is taking forever to calculate the aggregate.
I am using mutate and slice instead of summarize because the df is arranged in a certain way and I want to retain the Region in that first row.
However I think there could be a more efficient way. Please help on both. Can't figure it out for the life of me.
summarize
makes more sense to me than mutate
and slice
. This should save you some time.
library(dplyr)
result <- df %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region))
result
# # A tibble: 4 x 6
# # Groups: Month [3]
# Month ID Qty Sales Leads Region
# <chr> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 April 11 230 2100 22 East
# 2 June 11 260 2450 15 North
# 3 May 10 110 1000 8 East
# 4 May 12 110 900 9 North
We can apply generic speed-up strategies:
In this example, there is much overhead because of type checking. We could rewrite the code slightly to be more efficient by using the collapse
package, which provides a C++ interface to dplyr
functions.
library(collapse)
df |>
fgroup_by(Month, ID) |>
fsummarise(Qty = fsum(Qty),
Sales = fsum(Sales),
Leads = fsum(Leads),
Region = fsubset(Region, 1L),
keep.group_vars = T) |>
as_tibble() # optional
#> # A tibble: 4 x 6
#> Month ID Qty Sales Leads Region
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 April 11 230 2100 22 East
#> 2 June 11 260 2450 15 North
#> 3 May 10 110 1000 8 East
#> 4 May 12 110 900 9 North
Where |>
is a slightly faster pipe than %>%
, and collapse
package appended with f
provide a faster alternative (implemented in C++) to dplyr
functions.
In addition to www's approach, you will likely get significant speed up when working with large data.frames if you swap to a data.table
backend. The easiest conversion for this would be using the dtplyr
package, which ships with tidyverse
. We can convert it by adding two lines of code.
library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region)) %>%
as_tibble() # or data.table()
Note that this results is an ungrouped data.frame at the end.
Approaches are wrapped in functions for conciseness. dplyr
here is www's rewriting of OP's approach.
bench::mark(collapse = collapse(df), dplyr = dplyr(df), dtplyr = dtplyr(df),
time_unit = "ms", iterations = 200)[c(3,5,7)]
median mem_alloc n_itr
<dbl> <bch:byt> <int>
1 0.327 0B 199
2 5.51 8.73KB 196
3 7.21 75.89KB 196
We can see that collapse
is more memory efficient, and significantly faster compared to dplyr
. dtplyr
approach is included here, as its time complexity is different than that of dplyr
.
Further optimizations are still possible - for example by converting numeric columns into integer columns.
The Region
column loses information in OP's required approach, IE East and North for April and June in ID 11 is misleading.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.