简体   繁体   English

R-按列状态子集,并根据另一个数据框计数唯一记录

[英]R - Subset by column status and count unique records against another dataframe

I've got a dataset that looks like this: 我有一个看起来像这样的数据集:

customer_id    group_a    group_b    group_c    group_d
123            true       false      true       false
456            false      true       false      true
789            false      true       true       false

I also have each customer's record in a dataset like this. 我在这样的数据集中也有每个客户的记录。

customer_id    date
123            01/01/2019
123            01/02/2019
123            01/03/2019
123            01/04/2019
123            01/04/2019  

456            01/01/2019
456            01/02/2019
456            01/03/2019

789            01/01/2019
789            01/03/2019
789            01/03/2019

I'd like to be able to get the counts of unique records by date for every group iteration where the customer is "true" and the total number of customers for every group . 我希望能够按日期获得客户为“真”的每个组迭代的唯一记录数,以及每个组的客户总数 The result of which will look like this: 结果如下:

date         group    record   total
01/01/2019   a        1        1
01/02/2019   a        1        1
01/03/2019   a        1        1
01/04/2019   a        1        1

01/01/2019   b        2        2
01/02/2019   b        1        2
01/03/2019   b        2        2
01/04/2019   b        0        2

01/01/2019   c        2        2
01/02/2019   c        1        2
01/03/2019   c        2        2
01/04/2019   c        1        2

01/01/2019   d        1        1
01/02/2019   d        1        1
01/03/2019   d        1        1
01/04/2019   d        0        1

I don't feel this is very elegant, but the result matches your expected output so: Here it is. 我觉得这不是很优雅,但是结果符合您的预期输出,因此:在这里。


library(lubridate)
library(dplyr)
library(tidyr)

df2$date <- mdy(df2$date)

df2 %>% 
  inner_join(df1, by = "customer_id", copy = TRUE) %>%
  gather(key = "group", value = "member", group_a:group_d) %>%
  filter(member == "true") %>% 
  complete(date, group) %>%
  select(date, group, customer_id) ->  df3

df3 %>%
  group_by(group, date) %>% 
  summarise(record = n_distinct(customer_id, na.rm = TRUE)) %>% 
  left_join( df3 %>%
             group_by(group) %>%
             summarise(total = n_distinct(customer_id, na.rm = TRUE)),
             by = "group") %>% ungroup() %>%
  select(date, group, record, total) -> result

which gives: 这使:

# A tibble: 16 x 4
   date       group   record total
   <date>     <chr>    <int> <int>
 1 2019-01-01 group_a      1     1
 2 2019-01-02 group_a      1     1
 3 2019-01-03 group_a      1     1
 4 2019-01-04 group_a      1     1
 5 2019-01-01 group_b      2     2
 6 2019-01-02 group_b      1     2
 7 2019-01-03 group_b      2     2
 8 2019-01-04 group_b      0     2
 9 2019-01-01 group_c      2     2
10 2019-01-02 group_c      1     2
11 2019-01-03 group_c      2     2
12 2019-01-04 group_c      1     2
13 2019-01-01 group_d      1     1
14 2019-01-02 group_d      1     1
15 2019-01-03 group_d      1     1
16 2019-01-04 group_d      0     1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM