[英]How to count unique IDs at different dates in R?
I am a beginner in R, so I apologise in advance if the question seems dumb or if there is an obvious solution, or if it has already been solved somewhere else...我是 R 的初学者,所以如果问题看起来很愚蠢或者有明显的解决方案,或者已经在其他地方解决了,我提前道歉......
I have a df containing purchases with dates and clients ids provided :我有一个包含购买日期和客户 ID 的 df :
ANNEE Date clientID
1 2017 2017-01 aaa
2 2017 2017-01 bbb
3 2018 2018-01 aaa
4 2018 2018-02 aaa
5 2018 2018-01 bbb
6 2019 2019-01 aaa
7 2019 2019-01 ccc
8 2020 2020-01 ddd
9 2020 2020-01 ccc
I would like to know for each year what percentage of my clients were present in my df the previous year.我想知道每年我的客户中有多少百分比在前一年出现在我的 df 中。 In this example, that would look like :
在这个例子中,它看起来像:
dfObjective
Date Prop
2017 0
2018 1
2019 0.5
2020 0.5
I thought the first move would be to rearrange my df to count the number of clients present in one given year, regardless of how many purchases they made, and I have done it (though I'm sure there is a better way to do it)我认为第一步是重新安排我的 df 以计算给定年份中存在的客户数量,无论他们购买了多少,并且我已经做到了(尽管我确信有更好的方法来做到这一点) )
library(plyr)
clients = ddply(df, "ANNEE", summarise, Count = length(unique(ClientID)))
df2
ANNEE Count
2017 2
2018 2
2019 2
2020 2
However I can't find how to count the proportion of clients that already made at least one purchase the previous year...但是我找不到如何计算上一年已经至少购买一次的客户比例......
Thank you in advance !先感谢您 !
Here is a tidyverse
solution.这是一个
tidyverse
解决方案。
First, group by clientId
to determine which clients were in the previous year.首先,按
clientId
分组以确定哪些客户在上一年。 Then, group by year to find the proportions.然后,按年份分组以找到比例。
library(tidyverse)
df <- read_table2("
ANNEE Date clientID
2017 2017-01 aaa
2017 2017-01 bbb
2018 2018-01 aaa
2018 2018-02 aaa
2018 2018-01 bbb
2019 2019-01 aaa
2019 2019-01 ccc
2020 2020-01 ddd
2020 2020-01 ccc
")
df %>%
distinct(clientID, ANNEE) %>%
group_by(clientID) %>%
mutate(in_previous_year = (ANNEE - 1) %in% ANNEE) %>%
group_by(ANNEE) %>%
summarise(Prop = sum(in_previous_year) / n())
#> # A tibble: 4 x 2
#> ANNEE Prop
#> <dbl> <dbl>
#> 1 2017 0
#> 2 2018 1
#> 3 2019 0.5
#> 4 2020 0.5
Base R :基础 R :
data.frame(ANNEE = unique(df$ANNEE), prop =
rowMeans(apply(do.call(
rbind, lapply(with(df[order(df$ANNEE), ],
split(clientID, ANNEE)),
unique)
), 2, duplicated)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.