[英]dplyr to data.table for speed up execution time
I am currently dealing with a moderately large dataframe called d.mkt
( > 2M
rows and 12
columns).我目前正在处理一个中等大小的 dataframe,称为
d.mkt
( > 2M
行和12
列)。 As dplyr
is too slow when applying summarise()
function combined with group_by_at
, I am trying to write an equivalent statement using data.table
to speed up the summarise
computation part of dplyr
.由于
dplyr
在应用summarise()
function 与group_by_at
结合时太慢,我试图使用data.table
编写一个等效语句来加速dplyr
的summarise
计算部分。 However, the situation is quite special in the case that the original dataframe is group_by_at
and then summarising
over the same set of columns (eg X %>% select(-id) %>% group_by_at(vars(-x,-y,-z,-t) %>% summarise(x = sum(x), y = sum(y), z = sum(z), y = sum(t)) %>% ungroup()
).然而,在原始 dataframe 是
group_by_at
然后对同一组列进行summarising
(例如X %>% select(-id) %>% group_by_at(vars(-x,-y,-z,-t) %>% summarise(x = sum(x), y = sum(y), z = sum(z), y = sum(t)) %>% ungroup()
)。
With that in mind, below is my current attempt, which kept failing to work because of this error: keyby or by has length (1,1,1,1).
考虑到这一点,下面是我目前的尝试,由于这个错误而一直失败:
keyby or by has length (1,1,1,1).
Could someone please help let me know how to fix this error?有人可以帮我知道如何解决这个错误吗?
dplyr's code dplyr的代码
d.mkt <- d.mkt %>%
left_join(codes, by = c('rte_cd', 'cd')) %>%
mutate(is_valid = replace_na(is_valid, FALSE),
rte_cd = ifelse(is_valid, rte_cd, 'RC'),
rte_dsc = ifelse(is_valid, rte_dsc, 'SKIPPED')) %>%
select(-is_valid) %>%
group_by_at(vars(-c_rv, -g_rv, -h_rv, -rn)) %>%
summarise(c_rv = sum(as.numeric(c_rv)), g_rv = sum(as.numeric(g_rv)), h_rv = sum(as.numeric(h_rv)), rn = sum(as.numeric(rn))) %>%
ungroup()
My attempt for translating the above我尝试翻译以上内容
d.mkt <- as.data.table(d.mkt)
d.mkt <- d.mkt[codes, on = c('rte_cd', 'sb_cd'),
`:=` (is.valid = replace_na(is_valid, FALSE), rte_cd = ifelse(is_valid, rte_cd, 'RC00'),
rte_ds = ifelse(is_valid, rte_ds, 'SKIPPED'))]
d.mkt <- d.mkt[, -"is.valid", with=FALSE]
d.mkt <- d.mkt[, .(c_rv=sum(c_rv), g_rv=sum(g_rv), h_rv = sum(h_rv), rn = sum(rn)), by = .('prop', 'date')] --- Error here already, but how do we ungroup a `data.table` though?
Close.关闭。 Some suggestions/answers.
一些建议/答案。
data.table
for speed, I suggest use if fifelse
in lieu of replace_na
and ifelse
, minor.data.table
,我建议使用 if fifelse
代替replace_na
和ifelse
,次要的。is_valid
is d.mkt[, is.valid:= NULL]
.is_valid
的规范方法是d.mkt[, is.valid:= NULL]
。setdiff
.setdiff
可以完成分组。 In data.table
, there is no need to "ungroup", each [
-call uses its own grouping.data.table
中,不需要“取消分组”,每个[
-调用都使用自己的分组。 (For the reason, if you have multiple chained [
-operations that use the same grouping, it can be useful to store that group as a variable, perhaps index it, and/or combine all the [
-chain into a single call. This is prone to lots of benchmarking discussion outside the scope of what we have here.) [
-操作使用相同的分组,那么将该组存储为变量可能很有用,可能对其进行索引,和/或将所有[
-链组合到一个调用中。这很容易在我们这里的 scope 之外进行很多基准讨论。)lapply(.SD, ..)
this for a little readability improvement.lapply(.SD, ..)
这一点来提高可读性。 This might work:这可能有效:
library(data.table)
setDT(codes) # or using `as.data.table(codes)` below instead
setDT(d.mkt) # ditto
tmp <- codes[d.mkt, on = .(rte_cd, cd) ] %>%
.[, c("is_valid", "rte_cd", "rte_dsc") :=
.(fcoalesce(is_valid, FALSE),
fifelse(is.na(is_valid), rte_cd, "RC"),
fifelse(is.an(is_valid), rte_dsc, "SKIPPED")) ]
tmp[, is_valid := NULL ]
cols <- c("c_rv", "g_rv", "h_rv", "rn")
tmp[, lapply(.SD, function(z) sum(as.numeric(z))),
by = setdiff(names(tmp), cols), .SDcols = cols ]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.