简体   繁体   English

如何通过R中的ID计算日期之间的平均差异

[英]how to calculate the average difference between dates by ID in R

I have the data set like below and i want to calculate the average time difference for each unique id 我有如下数据集,我想计算每个唯一ID的平均时差

data:
   membership_id created_date 
1       12000000 2015-01-20   
2       12000001 2012-11-19   
3       12000001 2013-10-07   
4       12000001 2014-03-06   
5       12000001 2015-01-14   
6       12000003 2013-02-08   
7       12000003 2014-03-06
8       12000000 2014-02-05
9       12000000 2012-01-06

From the above data set i want to calculate the average time difference between dates for each unique id 从上面的数据集中,我想计算每个唯一ID的日期之间的平均时间差

TRIED: 尝试:

 library(plyr)
 data =data[order(data$membership_id,data$created_date),]
 result = ddply(data,.(membership_id),summarize, avg =  as.numeric(mean(diff(created_date))))

The above code is working fine when i am applying on the small data,but my data set is 5 million rows and it is taking lot of time and still it is running from last 6 hours 当我在小数据上应用时,上面的代码可以正常工作,但是我的数据集是500万行,这花了很多时间,但仍然从最近6小时开始运行

Expected output: 预期产量:

  membership_id  avg_time_diff
 1 12000000       76 days
 2 12000001       56 days
 3 12000003       54 days

Coming from plyr , you can probably transition very easily to dplyr . 来自plyr ,您可能很容易过渡到dplyr It won't be quite as fast as data table, but it will be much faster than ddply . 它不会是相当快的数据表,但它会比快得多 ddply

dat %>% group_by(membership_id) %>%
    arrange(created_date) %>%
    summarize(avg = as.numeric(mean(diff(created_date))))
# Source: local data frame [3 x 2]
#
#   membership_id   avg
#           (int) (dbl)
# 1      12000000   555
# 2      12000001   262
# 3      12000003   391

Without any more real effort, you can speed things up even more by converting to a data.table object but still use the dplyr commands. 无需付出更多实际努力,您就可以通过转换为data.table对象来加快处理速度,但仍可以使用dplyr命令。 Pure data.table will still be even faster. 纯数据data.table仍然会更快。

(Using this data) (使用此数据)

dat = structure(list(membership_id = c(12000000L, 12000001L, 12000001L, 
12000001L, 12000001L, 12000003L, 12000003L, 12000000L, 12000000L
), created_date = structure(c(16455, 15663, 15985, 16135, 16449, 
15744, 16135, 16106, 15345), class = "Date")), .Names = c("membership_id", 
"created_date"), row.names = c("1", "2", "3", "4", "5", "6", 
"7", "8", "9"), class = "data.frame")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM