[英]Speeding up group_by operations dplyr
我有很多組,我想對它進行分組操作(下面高度簡化的變異)。
z <- tibble(k1 = rep(seq(1, 600000, 1), 5),
category = sample.int(2, 3000000, replace = TRUE)) %>%
arrange(k1, category)
t1 <- z %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup()
這個操作很慢,但是如果我改為“手動”分組,這個過程很難閱讀,寫起來更煩人,但要快得多(20 倍)。
z %>%
mutate(x = if_else(category == 1 & lead(category) == 2 & k1 == lead(k1), "pie", "monkey"),
x = if_else(category == 1 & k1 != lead(k1), NA_character_, x))
很明顯,有一些方法可以加快進程。 有一個更好的方法嗎? 我嘗試使用 data.table,但它仍然比手動技術慢得多。
zDT <- z %>% data.table::as.data.table()
zDT[, x := if_else(category == 1 & lead(category) == 2, "pie", "monkey"), by = "k1"]
對於以自然、快速的方式進行此操作的任何建議?
我們可以加快速度而無需使用ifelse
library(data.table)
> system.time(setDT(z)[, x := c("monkey", "pie")[
1 + (category == 1 & shift(category, type = "lead") %in% 2)], by = k1])
user system elapsed
18.203 0.146 16.635
> system.time({t1 <- z %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup()
})
user system elapsed
37.319 0.321 37.523
編輯:duckdb 獲勝。 在輸出相同的情況下,比 @akrun 的 data.table 解決方案快 10 倍。
編輯#2:OP 中的小 nit 未指定鉛默認值,導致 NA 的鴨數據庫正在復制,但在@akrun 的 data.table 答案中被視為“猴子”。
出於對基准測試的好奇,我查看了duckdb
和collapse
包,它們都提供了 dplyr 屏蔽/翻譯到更快后端的版本。 崩潰版本快了一點,但duckdb
快了 10 倍。
Unit: milliseconds
expr min lq mean median uq max neval
duckdb 809.5969 825.1131 851.222 845.6702 868.2173 900.495 10
Unit: seconds
expr min lq mean median uq max neval
collapse 8.363416 8.456532 8.633155 8.582542 8.835366 8.926974 10
dt 9.211959 9.243295 9.330174 9.324183 9.433316 9.457501 10
我將基准測試分為兩部分,因為看起來我不能同時讓collapse
和duckdb
屏蔽dplyr。
第1部分
library(DBI); library(duckdb)
con <- dbConnect(duckdb())
duckdb_register(con, "z_duck", z)
microbenchmark::microbenchmark(times = 10,
duckdb = tbl(con, "z_duck") |>
group_by(k1) |>
mutate(x = if_else(category == 1 & lead(category, default = 0) == 2, # EDIT to set default when there lead(Category) is NA at the end of a group, to match data.table answer
"pie", "monkey")) |>
ungroup() |>
collect())
第2部分(重新啟動R后新鮮)
library(data.table)
library(collapse)
options(collapse_mask = "all")
microbenchmark::microbenchmark(times = 5,
collapse = z |>
group_by(k1) |>
mutate(x = if_else(category == 1 & lead(category) == 2,
"pie", "monkey")) |>
ungroup() |>
collect(),
dt = setDT(z)[, x := c("monkey", "pie")[
1 + (category == 1 & shift(category, type = "lead") %in% 2)], by = k1]
)
我通過將default = 0
添加到lead()
項來調整duckdb 公式,以符合 data.table 答案。 這證實了同樣的計算正在發生:
compare = data.frame(k1 = z$k1, category = z$category,
dt = dt$x, duckdb = duckdb$x)
compare %>%
count(duckdb == dt)
# duckdb == dt n
#1 TRUE 3000000
進行這些分組比較將相對昂貴。 如果可能,最好對整個表進行矢量化。 請注意, ifelse
比if_else
快,並且data.table
的shift
比lead
快。
library(data.table)
library(dplyr)
z <- setorder(data.table(k1 = rep(seq(1, 600000, 1), 5),
category = sample.int(2, 3000000, replace = TRUE)))
t1 <- copy(z)
t2 <- copy(z)
t3 <- copy(z)
t4 <- copy(z)
t5 <- copy(z)
microbenchmark::microbenchmark(
if_else = t1[, x := if_else(category == 1L & lead(category) == 2L, "pie", "monkey"), k1],
ifelse = t2[, x := ifelse(category == 1L & lead(category) == 2L, "pie", "monkey"), k1],
shift = t3[, x := ifelse(category == 1L & shift(category, -1) == 2L, "pie", "monkey"), k1],
ifelse3 = t4[, x := ifelse(category == 1L, ifelse(k1 == shift(k1, -1), ifelse(shift(category, -1) == 2L, "pie", "monkey"), NA_character_), "monkey")],
logic = t5[, x := c("monkey", NA_character_, "monkey", "pie")[((k1 == shift(k1, -1, 0L))*((shift(category, -1, 0L) == 2) + 1L) + 1L)*(category == 1) + 1L]],
times = 1,
check = "identical"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> if_else 25162.7484 25162.7484 25162.7484 25162.7484 25162.7484 25162.7484 1
#> ifelse 18150.7634 18150.7634 18150.7634 18150.7634 18150.7634 18150.7634 1
#> shift 9057.7585 9057.7585 9057.7585 9057.7585 9057.7585 9057.7585 1
#> ifelse3 1544.2912 1544.2912 1544.2912 1544.2912 1544.2912 1544.2912 1
#> logic 81.9844 81.9844 81.9844 81.9844 81.9844 81.9844 1
邏輯的復雜性主要是由於NA
行為。 如果monkey
可以代替NA
, t5
可以改為:
t5[, x := c("monkey", "pie")[((k1 == shift(k1, -1, 0L))*(shift(category, -1, 0L) == 2)*(k1 == shift(k1, -1, 0L))) + 1L]]
一種選擇是加載 {dtplyr},它允許您使用dplyr
語法並將其轉換為 data.table 語法。 要使用 {dtplyr},請在 {dplyr} 步驟之前添加lazy_dt()
,並在管道末尾使用as_tibble()
來評估生成的 data.table 代碼。
但是,duckdb 更快,底部比較
顯示生成的 data.table 代碼
(沒必要,只是習慣在這個答案中解釋過程)
library(dtplyr)
library(dplyr, w = F)
z <- tibble(k1 = rep(seq(1, 600000, 1), 5),
category = sample.int(2, 3000000, replace = TRUE)) %>%
arrange(k1, category)
z %>%
lazy_dt() %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup() %>%
show_query()
#> copy(`_DT1`)[, `:=`(x = fifelse(category == 1 & shift(category,
#> type = "lead") == 2, "pie", "monkey")), by = .(k1)]
由代表 package (v2.0.1.9000) 於 2022 年 8 月 12 日創建
比較時間
bench::mark(
duck =
tbl(con, "z_duck") |>
group_by(k1) |>
mutate(x = if_else(category == 1 & lead(category) == 2,
"pie", "monkey")) |>
ungroup() |>
collect()
, dt =
z %>%
lazy_dt() %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup() %>%
as_tibble()
, dplyr =
z %>%
group_by(k1) %>%
mutate(x = if_else(category == 1 & lead(category) == 2, "pie", "monkey")) %>%
ungroup()
)
# # A tibble: 3 × 13
# expres…¹ min median itr/s…² mem_a…³ gc/se…⁴ n_itr n_gc total_…⁵ result
# <bch:ex> <bch:tm> <bch:tm> <dbl> <bch:b> <dbl> <int> <dbl> <bch:tm> <list>
# 1 duck 691.13ms 691.13ms 1.45 34.4MB 0 1 0 691.13ms <tibble>
# 2 dt 10.64s 10.64s 0.0939 107.6MB 0.939 1 10 10.64s <tibble>
# 3 dplyr 1.68m 1.68m 0.00995 880.3MB 1.20 1 121 1.68m <tibble>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.