[英]Calculating sum based on differences of dates by grouping 2 or more columns
[英]Calculating difference between dates based on grouping one or more columns
我的数据集示例如下:
| id | Date | Buyer |
|:--:|-----------:|----------|
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 4 | 5/30/2018 | Chang |
| 4 | 7/4/2018 | Chang |
| 4 | 8/17/2018 | Chang |
| 5 | 5/25/2018 | Chunfei |
| 5 | 2/13/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/23/2019 | Chunfei |
| 5 | 2/25/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
我对此数据集有两套问题:
| id | Date | Buyer_id | Diff |
|:--:|-----------:|----------|------|
| 9 | 11/29/2018 | Jenny | NA |
| 9 | 11/29/2018 | Jenny | 0 |
| 9 | 11/29/2018 | Jenny | 0 |
| 4 | 5/30/2018 | Chang | NA |
| 4 | 7/4/2018 | Chang | 35 |
| 4 | 8/17/2018 | Chang | 44 |
| 5 | 5/25/2018 | Chunfei | NA |
| 5 | 2/13/2019 | Chunfei | 264 |
| 5 | 2/16/2019 | Chunfei | 3 |
| 5 | 2/16/2019 | Chunfei | 0 |
| 5 | 2/23/2019 | Chunfei | 7 |
| 5 | 2/25/2019 | Chunfei | 2 |
| 8 | 2/28/2019 | Chunfei | NA |
| 8 | 2/28/2019 | Chunfei | 0 |
问题是我不理解为什么group_by无法正常工作。 以下代码减去连续的行,而不是将相同的买方和ID分组,然后相减。
df=data.frame(id=c("9","9","9","4","4","4","5","5","5","5","5","5","8","8"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/30/2018","7/4/2018",
"8/17/2018","5/25/2018","2/13/2019","2/16/2019","2/16/2019","2/23/2019",
"2/25/2019","2/28/2019","2/28/2019"),Buyer=c("Jenny","Jenny","Jenny",
"Chang","Chang","Chang","Chunfei","Chunfei","Chunfei","Chunfei","Chunfei",
"Chunfei","Chunfei","Chunfei"))
df$id=as.numeric(as.character(df$id))
df$Date=as.Date(df$Date, "%m/%d/%Y")
df$Buyer=as.character(df$Buyer)
df1=df %>% group_by(Buyer,id) %>%
mutate(diff=as.numeric(difftime(Date,lag(Date),units='days')))
我们可以在最终输出中屏蔽“ diff”列,其外观应如下所示:
| id | Date | Buyer_id |
|----|:----------:|---------:|
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 9 | 11/29/2018 | Jenny |
| 5 | 2/13/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/16/2019 | Chunfei |
| 5 | 2/23/2019 | Chunfei |
| 5 | 2/25/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
| 8 | 2/28/2019 | Chunfei |
我们可以使用diff
减去Date
并选择至少有一个小于5天的值的组。
library(dplyr)
df %>%
group_by(id, Buyer) %>%
filter(any(diff(Date) <= 5))
# id Date Buyer
# <dbl> <date> <chr>
# 1 9 2018-11-29 Jenny
# 2 9 2018-11-29 Jenny
# 3 9 2018-11-29 Jenny
# 4 5 2018-05-25 Chunfei
# 5 5 2019-02-13 Chunfei
# 6 5 2019-02-16 Chunfei
# 7 5 2019-02-16 Chunfei
# 8 5 2019-02-23 Chunfei
# 9 5 2019-02-25 Chunfei
#10 8 2019-02-28 Chunfei
#11 8 2019-02-28 Chunfei
重新阅读问题之后,我认为您可能不是在filter
整个组,而是仅filter
相差5天的那些行。 我们可以得到diff
值小于5的索引,并选择它的前一个索引。
df %>%
group_by(id, Buyer) %>%
mutate(diff = c(NA, diff(Date))) %>%
slice({i1 <- which(diff <= 5); unique(c(i1, i1-1))}) %>%
select(-diff)
# id Date Buyer
# <dbl> <date> <chr>
# 1 5 2019-02-16 Chunfei
# 2 5 2019-02-16 Chunfei
# 3 5 2019-02-25 Chunfei
# 4 5 2019-02-13 Chunfei
# 5 5 2019-02-23 Chunfei
# 6 8 2019-02-28 Chunfei
# 7 8 2019-02-28 Chunfei
# 8 9 2018-11-29 Jenny
# 9 9 2018-11-29 Jenny
#10 9 2018-11-29 Jenny
数据
df <- structure(list(id = c(9, 9, 9, 4, 4, 4, 5, 5, 5, 5, 5, 5, 8,
8), Date = structure(c(17864, 17864, 17864, 17681, 17716, 17760,
17676, 17940, 17943, 17943, 17950, 17952, 17955, 17955), class = "Date"),
Buyer = c("Jenny", "Jenny", "Jenny", "Chang", "Chang", "Chang",
"Chunfei", "Chunfei", "Chunfei", "Chunfei", "Chunfei", "Chunfei",
"Chunfei", "Chunfei")), row.names = c(NA, -14L), class = "data.frame")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.