[英]Calculations within subsets of dataframe [R]
面对子集计算的困难。 我可以使用ave
, tapply
, ddply
获得客户(因子)平均购买的总体统计数据,但我无法通过每个客户的访问统计数据来计算访问量 。 下面的一些简化数据说明了我的数据和理想的结果。
当前数据帧:(请注意,访问#1是最近的访问)
customer visit date purchase_amt
sarah 2 2013-08-09 5
sarah 3 2013-07-21 8
sarah 4 2013-06-23 9
sarah 5 2013-06-02 1
sarah 1 2013-08-20 8
henry 1 2013-07-04 4
che 1 2013-08-27 2
che 2 2013-07-27 1
che 3 2013-07-05 8
che 4 2013-06-14 3
dt 3 2013-04-05 9
dt 2 2013-06-07 1
dt 1 2013-07-11 6
这些是我寻求的结果:
customer visit date purchase_amt days since amt_diff
sarah 2 2013-08-09 5 19 -3
sarah 3 2013-07-21 8 28 -1
sarah 4 2013-06-23 9 21 8
sarah 5 2013-06-02 1 NA NA
sarah 1 2013-08-20 8 11 3
henry 1 2013-07-04 4 NA NA
che 1 2013-08-27 2 31 1
che 2 2013-07-27 1 22 -7
che 3 2013-07-05 8 21 5
che 4 2013-06-14 3 NA NA
dt 3 2013-04-05 9 NA NA
dt 2 2013-06-07 1 63 -8
dt 1 2013-07-11 6 34 5
总而言之,我想找到一个客户的最近访问及其属性,然后找到下一个访问属性并计算两者的各种统计数据。 没有更多先前访问时返回“NA”。
像这样的东西? 假设您的数据被称为df
:
library(plyr)
# convert dates to class 'Date'
df$date <- as.Date(df$date)
# order by customer and date
df <- df[order(df$customer, df$date), ]
# or since plyr is loaded anyway:
df <- arrange(df, customer, date)
# per customer, calculate differences in date and purchase, between consecutive visits
# pad differences with a leading NA
df2 <- ddply(.data = df, .variables = .(customer), mutate,
days_since = c(NA, diff(date)),
amt_diff = c(NA, diff(purchase_amt)))
df2
# customer visit date purchase_amt days_since amt_diff
# 1 che 4 2013-06-14 3 NA NA
# 2 che 3 2013-07-05 8 21 5
# 3 che 2 2013-07-27 1 22 -7
# 4 che 1 2013-08-27 2 31 1
# 5 dt 3 2013-04-05 9 NA NA
# 6 dt 2 2013-06-07 1 63 -8
# 7 dt 1 2013-07-11 6 34 5
# 8 henry 1 2013-07-04 4 NA NA
# 9 sarah 5 2013-06-02 1 NA NA
# 10 sarah 4 2013-06-23 9 21 8
# 11 sarah 3 2013-07-21 8 28 -1
# 12 sarah 2 2013-08-09 5 19 -3
# 13 sarah 1 2013-08-20 8 11 3
此解决方案仅使用R的基数并保留输入的原始顺序:
# Sort, calculate differences and unsort.
# r is row indexes to use, order.by is ordering vector, col is vector to difference
diffs <- function(r, order.by, col) {
order.by <- order.by[r]
col <- col[r]
o <- order(order.by)
replace(r, o, c(NA, diff(col[o])))
}
# fun specialized to arguments after first, i.e. subsequent arguments curried
curry <- function (fun, ...) function(r) fun(r, ...)
ix <- 1:nrow(DF)
transform(DF,
days_since = ave(ix, customer, FUN = curry(diffs, date, date)),
amt_diff = ave(ix, customer, FUN = curry(diffs, date, purchase_amt))
)
结果是:
customer visit date purchase_amt days_since amt_diff
1 sarah 2 2013-08-09 5 19 -3
2 sarah 3 2013-07-21 8 28 -1
3 sarah 4 2013-06-23 9 21 8
4 sarah 5 2013-06-02 1 NA NA
5 sarah 1 2013-08-20 8 11 3
6 henry 1 2013-07-04 4 NA NA
7 che 1 2013-08-27 2 31 1
8 che 2 2013-07-27 1 22 -7
9 che 3 2013-07-05 8 21 5
10 che 4 2013-06-14 3 NA NA
11 dt 3 2013-04-05 9 NA NA
12 dt 2 2013-06-07 1 63 -8
13 dt 1 2013-07-11 6 34 5
更新:对代码的微小改进。
这是与@Henrik一致的data.table解决方案:
df<-structure(list(customer = structure(c(4L, 4L, 4L, 4L, 4L, 3L,
1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("che", "dt", "henry",
"sarah"), class = "factor"), visit = c(2L, 3L, 4L, 5L, 1L, 1L,
1L, 2L, 3L, 4L, 3L, 2L, 1L), date = structure(c(15926, 15907,
15879, 15858, 15937, 15890, 15944, 15913, 15891, 15870, 15800,
15863, 15897), class = "Date"), purchase_amt = c(5L, 8L, 9L,
1L, 8L, 4L, 2L, 1L, 8L, 3L, 9L, 1L, 6L)), .Names = c("customer",
"visit", "date", "purchase_amt"), row.names = c(NA, -13L), class =
"data.frame")
library(data.table)
df<-data.table(df)
df[,list(visit=visit,date=date, purchase_amt=purchase_amt,days_since = c(NA, diff(date)),amt_diff = c(NA, diff(purchase_amt))),keyby="customer"]
customer visit date purchase_amt days_since amt_diff
1: che 1 2013-08-27 2 NA NA
2: che 2 2013-07-27 1 -31 -1
3: che 3 2013-07-05 8 -22 7
4: che 4 2013-06-14 3 -21 -5
5: dt 3 2013-04-05 9 NA NA
6: dt 2 2013-06-07 1 63 -8
7: dt 1 2013-07-11 6 34 5
8: henry 1 2013-07-04 4 NA NA
9: sarah 2 2013-08-09 5 NA NA
10: sarah 3 2013-07-21 8 -19 3
11: sarah 4 2013-06-23 9 -28 1
12: sarah 5 2013-06-02 1 -21 -8
13: sarah 1 2013-08-20 8 79 7
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.