[英]rowwise operation with dplyr
I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. 我正在研究2,3万条记录的R中的大型数据框,其中包含具有开始和停止时间的位置的用户交易。 My goal is to create a new dataframe that contains the amount of time connected per user/per location.
我的目标是创建一个新的数据框,其中包含每个用户/每个位置连接的时间量。 Let's call this hourly connected.
我们称这是每小时连接一次。
Transaction can differ from 8 minutes to 48 hours, thus the goal dataframe will be around 100 Million records and will grow each month. 交易可以从8分钟到48小时不等,因此目标数据框将是大约1亿条记录,并且每个月都会增长。
The code underneath shows how the final dataframe is developed, although the total code is much complexer. 下面的代码显示了最终数据框的开发方式,尽管总代码更复杂。 Running the total code takes ~ 9 hours on a Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 cores 128GB RAM.
在英特尔(R)Xeon(R)CPU E5-2630 v3 @ 2.40GHz,16核128GB RAM上运行总代码大约需要9个小时。
library(dplyr)
numsessions<-1000000
startdate <-as.POSIXlt(runif(numsessions,1,365*60*60)*24,origin="2015-1-1")
df.Sessions<-data.frame(userID = round(runif(numsessions,1,500)),
postalcode = round(runif(numsessions,1,100)),
daynr = format(startdate,"%w"),
start =startdate ,
end= startdate + runif(1,1,60*60*10)
)
dfhourly.connected <-df.Sessions %>% rowwise %>% do(data.frame(userID=.$userID,
hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
hournr=format(seq(.$start,.$end,by=60*60),"%H")
)
)
We want to parallelize this procedure over (some of) the 16 cores to speed up the procedure. 我们希望在16个核心(部分)上并行化此过程以加速该过程。 A first attempt was to use the
multidplyr
package. 第一次尝试是使用
multidplyr
包。 The partition is made based on daynr
分区基于
daynr
df.hourlyconnected<-df.Sessions %>%
partition(daynr,cluster=init_cluster(6)) %>%
rowwise %>% do(data.frame(userID=.$userID,
hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
hournr=format(seq(.$start,.$end,by=60*60),"%H")
)
) %>% collect()
Now, the rowwise
function appears to require a dataframe as input instead of a partition. 现在,
rowwise
函数似乎需要数据帧作为输入而不是分区。
Is there a workaround to perform a rowwise calculation on partitions per core? 是否有解决方法来对每个核心的分区执行逐行计算?
Has anyone got a suggestion to perform this calculation with a different R package and methods? 有没有人有建议用不同的R包和方法执行这个计算?
(I think posting this as an answer could benefit future readers who have interest in efficient coding.) (我认为将此作为答案发布可能会使对未来对有效编码感兴趣的读者受益。)
R is a vectorized language, thus operations by row are one of the most costly operations; R是矢量化语言,因此按行操作是最昂贵的操作之一; Especially if you are evaluating lots of functions, dispatching methods, converting classes and creating new data set while you at it.
特别是如果您正在评估许多函数,调度方法,转换类和创建新数据集。
Hence, the first step is to reduce the " by " operations. 因此,第一步是减少“ 通过 ”操作。 By looking at your code, it seems that you are enlarging the size of your data set according to
userID
, start
and end
- all the rest of the operations could come afterwords (and hence be vectorized). 通过查看您的代码,您似乎正在根据
userID
, start
和end
来扩大数据集的大小 - 所有其余操作都可以在后面进行(因此可以进行矢量化)。 Also, running seq
(which isn't a very efficient function by itself) twice by row adds nothing. 另外,两次按行运行
seq
(它本身不是一个非常有效的函数)不会增加任何内容。 Lastly, calling explicitly seq.POSIXt
on a POSIXt
class will save you the overhead of method dispatching. 最后,在
POSIXt
类上显式调用seq.POSIXt
将为您节省方法调度的开销。
I'm not sure how to do this efficiently with dplyr
, because mutate
can't handle it and the do
function (IIRC) always proved it self to be highly inefficient. 我不确定如何使用
dplyr
有效地执行此dplyr
,因为mutate
无法处理它,并且do
函数(IIRC)总是证明它自身是非常低效的。 Hence, let's try the data.table
package that can handle this task easily 因此,让我们尝试一下可以轻松处理此任务的
data.table
包
library(data.table)
res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = .(userID, start, end)]
Again, please note that I minimized " by row " operations to a single function call while avoiding methods dispatch 再次请注意,我将“ 按行 ”操作最小化为单个函数调用,同时避免了方法调度
Now that we have the data set ready, we don't need any by row operations any more, everything can be vectorized from now on. 现在我们已准备好数据集,我们不再需要任何行操作,从现在开始,所有内容都可以进行矢量化。
Though, vectorizing isn't the end of story. 虽然,矢量化并不是故事的结局。 We also need to take into consideration classes conversions, method dispatching, etc. For instance, we can create both the
hourlydate
and hournr
using either different Date
class functions or using format
or maybe even substr
. 我们还需要考虑类转换,方法调度等。例如,我们可以使用不同的
Date
类函数或使用format
或甚至substr
来创建hourlydate
和hournr
。 The trade off that needs to be taken in account is that, for instance, substr
will be the fastest, but the result will be a character
vector rather a Date
one - it's up to you to decide if you prefer the speed or the quality of the end product. 需要考虑的权衡是,例如,
substr
将是最快的,但结果将是一个character
向量而不是Date
- 由您来决定您是否更喜欢速度或质量最终产品。 Sometimes you can win both, but first you should check your options. 有时你可以赢得两者,但首先你应该检查你的选择。 Lets benchmark 3 different vectorized ways of calculating the
hournr
variable 让我们基于3种不同的矢量化方法来计算
hournr
变量
library(microbenchmark)
set.seed(123)
N <- 1e5
test <- as.POSIXlt(runif(N, 1, 1e5), origin = "1900-01-01")
microbenchmark("format" = format(test, "%H"),
"substr" = substr(test, 12L, 13L),
"data.table::hour" = hour(test))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# format 273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88 100 b
# substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27 100 c
# data.table::hour 5.121 7.681 23.9746 27.84 33.44 55.36 100 a
data.table::hour
is the clear winner by both speed and quality (results are in an integer vector rather a character one), while improving the speed of your previous solution by factor of ~x12,000 (and I haven't even tested it against your by row implementation). data.table::hour
是速度和质量的明显赢家(结果是一个整数向量而不是一个字符),同时将你之前的解决方案的速度提高了~x12,000倍 (我甚至没有测试它对你的行实现)。
Now lets try 3 different ways for data.table::hour
现在让我们尝试3种不同的
data.table::hour
microbenchmark("as.Date" = as.Date(test),
"substr" = substr(test, 1L, 10L),
"data.table::as.IDate" = as.IDate(test))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# as.Date 19.56285 20.09563 23.77035 20.63049 21.16888 50.04565 100 a
# substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895 100 b
# data.table::as.IDate 19.91964 20.44250 27.50989 21.34551 31.79939 145.65133 100 a
Seems like the first and third options are pretty much the same speed-wise, while I prefer as.IDate
because of the integer
storage mode. 似乎第一个和第三个选项在速度方面几乎相同,而我更喜欢
as.IDate
因为integer
存储模式。
Now that we know where both efficiency and quality lies, we could simply finish the task by running 既然我们知道效率和质量在哪里,我们就可以通过运行来完成任务
res[, `:=`(hourlydate = as.IDate(V1), hournr = hour(V1))]
(You can then easily remove the unnecessary columns using a similar syntax of res[, yourcolname := NULL]
which I'll leave to you) (然后你可以使用
res[, yourcolname := NULL]
的类似语法轻松删除不必要的列,我将留给你)
There could be probably more efficient ways of solving this, but this demonstrates a possible way of how to make your code more efficient. 可能有更有效的方法来解决这个问题,但这证明了如何提高代码效率的可行方法。
As a side note, if you want further to investigate data.table
syntax/features, here's a good read 作为旁注,如果您想进一步调查
data.table
语法/功能,这里是一个很好的阅读
https://github.com/Rdatatable/data.table/wiki/Getting-started https://github.com/Rdatatable/data.table/wiki/Getting-started
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.