简体   繁体   English

使用dplyr进行逐行操作

[英]rowwise operation with dplyr

I am working on a large dataframe in R of 2,3 Million records that contain transactions of users at locations with starting and stop times. 我正在研究2,3万条记录的R中的大型数据框,其中包含具有开始和停止时间的位置的用户交易。 My goal is to create a new dataframe that contains the amount of time connected per user/per location. 我的目标是创建一个新的数据框,其中包含每个用户/每个位置连接的时间量。 Let's call this hourly connected. 我们称这是每小时连接一次。

Transaction can differ from 8 minutes to 48 hours, thus the goal dataframe will be around 100 Million records and will grow each month. 交易可以从8分钟到48小时不等,因此目标数据框将是大约1亿条记录,并且每个月都会增长。

The code underneath shows how the final dataframe is developed, although the total code is much complexer. 下面的代码显示了最终数据框的开发方式,尽管总代码更复杂。 Running the total code takes ~ 9 hours on a Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 cores 128GB RAM. 在英特尔(R)Xeon(R)CPU E5-2630 v3 @ 2.40GHz,16核128GB RAM上运行总代码大约需要9个小时。

library(dplyr)

numsessions<-1000000
startdate <-as.POSIXlt(runif(numsessions,1,365*60*60)*24,origin="2015-1-1")

df.Sessions<-data.frame(userID = round(runif(numsessions,1,500)),
           postalcode = round(runif(numsessions,1,100)),
           daynr = format(startdate,"%w"),
              start =startdate ,
              end=   startdate + runif(1,1,60*60*10)
           )


dfhourly.connected <-df.Sessions %>% rowwise %>% do(data.frame(userID=.$userID,
                                          hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
                                          hournr=format(seq(.$start,.$end,by=60*60),"%H")
                                          )
                               )

We want to parallelize this procedure over (some of) the 16 cores to speed up the procedure. 我们希望在16个核心(部分)上并行化此过程以加速该过程。 A first attempt was to use the multidplyr package. 第一次尝试是使用multidplyr包。 The partition is made based on daynr 分区基于daynr

df.hourlyconnected<-df.Sessions %>% 
                      partition(daynr,cluster=init_cluster(6)) %>%
                      rowwise %>% do(data.frame(userID=.$userID,
                            hourlydate=as.Date(seq(.$start,.$end,by=60*60)),
                            hournr=format(seq(.$start,.$end,by=60*60),"%H")
                              )
                            ) %>% collect()

Now, the rowwise function appears to require a dataframe as input instead of a partition. 现在, rowwise函数似乎需要数据帧作为输入而不是分区。

My questions are 我的问题是

  • Is there a workaround to perform a rowwise calculation on partitions per core? 是否有解决方法来对每个核心的分区执行逐行计算?

  • Has anyone got a suggestion to perform this calculation with a different R package and methods? 有没有人有建议用不同的R包和方法执行这个计算?

(I think posting this as an answer could benefit future readers who have interest in efficient coding.) (我认为将此作为答案发布可能会使对未来对有效编码感兴趣的读者受益。)


R is a vectorized language, thus operations by row are one of the most costly operations; R是矢量化语言,因此按行操作是最昂贵的操作之一; Especially if you are evaluating lots of functions, dispatching methods, converting classes and creating new data set while you at it. 特别是如果您正在评估许多函数,调度方法,转换类和创建新数据集。

Hence, the first step is to reduce the " by " operations. 因此,第一步是减少“ 通过 ”操作。 By looking at your code, it seems that you are enlarging the size of your data set according to userID , start and end - all the rest of the operations could come afterwords (and hence be vectorized). 通过查看您的代码,您似乎正在根据userIDstartend来扩大数据集的大小 - 所有其余操作都可以在后面进行(因此可以进行矢量化)。 Also, running seq (which isn't a very efficient function by itself) twice by row adds nothing. 另外,两次按行运行seq (它本身不是一个非常有效的函数)不会增加任何内容。 Lastly, calling explicitly seq.POSIXt on a POSIXt class will save you the overhead of method dispatching. 最后,在POSIXt类上显式调用seq.POSIXt将为您节省方法调度的开销。

I'm not sure how to do this efficiently with dplyr , because mutate can't handle it and the do function (IIRC) always proved it self to be highly inefficient. 我不确定如何使用dplyr有效地执行此dplyr ,因为mutate无法处理它,并且do函数(IIRC)总是证明它自身是非常低效的。 Hence, let's try the data.table package that can handle this task easily 因此,让我们尝试一下可以轻松处理此任务的data.table

library(data.table) 
res <- setDT(df.Sessions)[, seq.POSIXt(start, end, by = 3600), by = .(userID, start, end)] 

Again, please note that I minimized " by row " operations to a single function call while avoiding methods dispatch 再次请注意,我将“ 按行 ”操作最小化为单个函数调用,同时避免了方法调度


Now that we have the data set ready, we don't need any by row operations any more, everything can be vectorized from now on. 现在我们已准备好数据集,我们不再需要任何行操作,从现在开始,所有内容都可以进行矢量化。

Though, vectorizing isn't the end of story. 虽然,矢量化并不是故事的结局。 We also need to take into consideration classes conversions, method dispatching, etc. For instance, we can create both the hourlydate and hournr using either different Date class functions or using format or maybe even substr . 我们还需要考虑类转换,方法调度等。例如,我们可以使用不同的Date类函数或使用format或甚至substr来创建hourlydatehournr The trade off that needs to be taken in account is that, for instance, substr will be the fastest, but the result will be a character vector rather a Date one - it's up to you to decide if you prefer the speed or the quality of the end product. 需要考虑的权衡是,例如, substr将是最快的,但结果将是一个character向量而不是Date - 由您来决定您是否更喜欢速度或质量最终产品。 Sometimes you can win both, but first you should check your options. 有时你可以赢得两者,但首先你应该检查你的选择。 Lets benchmark 3 different vectorized ways of calculating the hournr variable 让我们基于3种不同的矢量化方法来计算hournr变量

library(microbenchmark)
set.seed(123)
N <- 1e5
test <- as.POSIXlt(runif(N, 1, 1e5), origin = "1900-01-01")

microbenchmark("format" = format(test, "%H"),
               "substr" = substr(test, 12L, 13L),
               "data.table::hour" = hour(test))

# Unit: microseconds
#             expr        min         lq        mean    median        uq       max neval cld
#           format 273874.784 274587.880 282486.6262 275301.78 286573.71 384505.88   100  b 
#           substr 486545.261 503713.314 529191.1582 514249.91 528172.32 667254.27   100   c
# data.table::hour      5.121      7.681     23.9746     27.84     33.44     55.36   100 a  

data.table::hour is the clear winner by both speed and quality (results are in an integer vector rather a character one), while improving the speed of your previous solution by factor of ~x12,000 (and I haven't even tested it against your by row implementation). data.table::hour是速度和质量的明显赢家(结果是一个整数向量而不是一个字符),同时将你之前的解决方案的速度提高了~x12,000倍 (我甚至没有测试它对你的行实现)。

Now lets try 3 different ways for data.table::hour 现在让我们尝试3种不同的data.table::hour

microbenchmark("as.Date" = as.Date(test), 
               "substr" = substr(test, 1L, 10L),
               "data.table::as.IDate" = as.IDate(test))

# Unit: milliseconds
#                 expr       min        lq      mean    median        uq       max neval cld
#              as.Date  19.56285  20.09563  23.77035  20.63049  21.16888  50.04565   100  a 
#               substr 492.61257 508.98049 525.09147 515.58955 525.20586 663.96895   100   b
# data.table::as.IDate  19.91964  20.44250  27.50989  21.34551  31.79939 145.65133   100  a 

Seems like the first and third options are pretty much the same speed-wise, while I prefer as.IDate because of the integer storage mode. 似乎第一个和第三个选项在速度方面几乎相同,而我更喜欢as.IDate因为integer存储模式。


Now that we know where both efficiency and quality lies, we could simply finish the task by running 既然我们知道效率和质量在哪里,我们就可以通过运行来完成任务

res[, `:=`(hourlydate = as.IDate(V1), hournr = hour(V1))]

(You can then easily remove the unnecessary columns using a similar syntax of res[, yourcolname := NULL] which I'll leave to you) (然后你可以使用res[, yourcolname := NULL]的类似语法轻松删除不必要的列,我将留给你)


There could be probably more efficient ways of solving this, but this demonstrates a possible way of how to make your code more efficient. 可能有更有效的方法来解决这个问题,但这证明了如何提高代码效率的可行方法。

As a side note, if you want further to investigate data.table syntax/features, here's a good read 作为旁注,如果您想进一步调查data.table语法/功能,这里是一个很好的阅读

https://github.com/Rdatatable/data.table/wiki/Getting-started https://github.com/Rdatatable/data.table/wiki/Getting-started

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM