简体   繁体   中英

How can we filter a join in R using DPLYR

I am working on a project to find the number of times a service has to be re-worked but only within the first 30 days of the end of the original service. Using the sample datasets below:

work <- tribble(
  ~Client ,  ~Initial_Date_In, ~Initial_Date_Out,
  'Car1',     '2019-01-01', '2019-01-02',
  'Car1',     '2019-06-01', '2019-06-03',
  'Car2',     '2019-01-01', '2019-01-02',
  'Truck1',     '2019-06-01', '2019-06-02',
  'Truck2',     '2019-04-01', '2019-04-02',
  'Truck3',     '2019-08-01', '2019-08-04',
  'Van1',     '2019-01-01', '2019-01-02',
  'Van2',     '2019-06-01', '2019-06-02',
  'Truck4',     '2019-10-05', '2019-10-09',
  
  
)

rework <- tribble(
  ~Client ,  ~Date_In, ~Date_Out,
  'Car1',     '2019-01-05', '2019-01-07',
  'Car1',     '2019-01-09', '2019-01-10',
  'Car1',     '2019-06-09', '2019-06-10',
  'Truck3',     '2019-08-07', '2019-08-08',
  'Van1',     '2019-03-01', '2019-03-01',
  'Van2',     '2019-06-09', '2019-06-11',
  'Truck4',     '2019-11-01', '2019-11-02',
  
  
)

work$Initial_Date_In <- lubridate::as_date(work$Initial_Date_In)
work$Initial_Date_Out <- lubridate::as_date(work$Initial_Date_Out)
rework$Date_In <- lubridate::as_date(rework$Date_In)
rework$Date_Out <- lubridate::as_date(rework$Date_Out)

I tried multiple things, such as:

work %>% 
  left_join(select(rework, Client,Date_Out), by = ("Client" = "Client")) %>% 
  group_by(Client) %>% 
  filter(max(Date_Out) >Initial_Date_Out & max(Date_Out) < (Initial_Date_Out+duration(30,"days")))

This code seems to pull the max Initial_Date_In(Out), but I want all from the work table where there is an appropriate entry in rework (displaying on the max Date_Out from the rework table)

The expected output would be something like:

Client    Initial_Date_In    Initial_Date_Out    Date_Out    #_Of_Rework_Visits
Car1      2019-01-01         2019-01-02          2019-01-10    2
Car1      2019-06-01         2019-06-03          2019-06-10    1
Truck3    2019-08-01         2019-08-04          2019-08-08    1
Van1      2019-06-01         2019-06-02          2019-06-11    1
Truck4    2019-10-05         2019-10-09          2019-11-02    1

Any tips on what I am doing wrong?

Update
Here's an approach than avoids doing any kind of formal join.
Stack work and rework with bind_rows , then use a service_ix column (index for service appts within a given Client ) as a grouping/summarizing variable:

work %>% 
  mutate(service = "original") %>% 
  rename(Date_In = "Initial_Date_In", Date_Out = "Initial_Date_Out") %>% 
  group_by(Client) %>% 
  mutate(service_ix = row_number()) %>% 
  bind_rows(rework %>% 
              mutate(service = "rework") %>% 
              select(Client, Date_In, Date_Out, service)) %>% 
  arrange(Client, Date_In) %>% 
  fill(service_ix) %>% 
  group_by(Client, service_ix) %>% 
  summarise(initial_date_in = Date_In[service == "original"],
            initial_date_out = Date_Out[service == "original"],
            max_date_out = max(Date_Out),
            n_reworks = sum(service == "rework")) %>%
  filter(n_reworks > 0) %>% 
  ungroup()

Unpacking a bit:

  1. Added a service column indicating initial vs rework. Here's the combined df after bind_rows() :
# A tibble: 16 x 5
# Groups:   Client [8]
   Client Date_In    Date_Out   service  service_ix
   <chr>  <date>     <date>     <chr>         <int>
 1 Car1   2019-01-01 2019-01-02 original          1
 2 Car1   2019-01-05 2019-01-07 rework           NA
 3 Car1   2019-01-09 2019-01-10 rework           NA
 4 Car1   2019-06-01 2019-06-03 original          2
 5 Car1   2019-06-09 2019-06-10 rework           NA
 6 Car2   2019-01-01 2019-01-02 original          1
 7 Truck1 2019-06-01 2019-06-02 original          1
 8 Truck2 2019-04-01 2019-04-02 original          1
 9 Truck3 2019-08-01 2019-08-04 original          1
10 Truck3 2019-08-07 2019-08-08 rework           NA
11 Truck4 2019-10-05 2019-10-09 original          1
12 Truck4 2019-11-01 2019-11-02 rework           NA
13 Van1   2019-01-01 2019-01-02 original          1
14 Van1   2019-03-01 2019-03-01 rework           NA
15 Van2   2019-06-01 2019-06-02 original          1
16 Van2   2019-06-09 2019-06-11 rework           NA
  1. The fill step replaces the service_ix NA values:
   Client Date_In    Date_Out   service  service_ix
   <chr>  <date>     <date>     <chr>         <int>
 1 Car1   2019-01-01 2019-01-02 original          1
 2 Car1   2019-01-05 2019-01-07 rework            1
 3 Car1   2019-01-09 2019-01-10 rework            1
 4 Car1   2019-06-01 2019-06-03 original          2
 5 Car1   2019-06-09 2019-06-10 rework            2
  1. Now group by Client and service_ix , and summarize as needed.
# A tibble: 6 x 6
  Client service_ix initial_date_in initial_date_out max_date_out n_reworks
  <chr>       <int> <date>          <date>           <date>           <int>
1 Car1            1 2019-01-01      2019-01-02       2019-01-10           2
2 Car1            2 2019-06-01      2019-06-03       2019-06-10           1
3 Truck3          1 2019-08-01      2019-08-04       2019-08-08           1
4 Truck4          1 2019-10-05      2019-10-09       2019-11-02           1
5 Van1            1 2019-01-01      2019-01-02       2019-03-01           1
6 Van2            1 2019-06-01      2019-06-02       2019-06-11           1

Here is something else to try, using the fuzzyjoin package.

First, calculate a Date_End_Service date which is 30 days after the Initial_Date_Out . Then, you can do a fuzzy_inner_join between work and rework , matching on Client , and making sure the Initial_Date_Out is <= Date_Out in rework , and the Date_End_Service is past Date_Out .

library(tidyverse)
library(lubridate)
library(fuzzyjoin)

work %>%
  mutate(Date_End_Service = Initial_Date_Out + days(30)) %>%
  fuzzy_inner_join(rework, 
                   by = c("Client" = "Client", "Initial_Date_Out" = "Date_Out", "Date_End_Service" = "Date_Out"), 
                   match_fun = list(`==`, `<=`, `>=`)) %>%
  group_by(Client.x, Initial_Date_In) %>%
  summarise(n_service = n())

Output

  Client.x Initial_Date_In n_service
  <chr>    <date>              <int>
1 Car1     2019-01-01              2
2 Car1     2019-06-01              1
3 Truck3   2019-08-01              1
4 Truck4   2019-10-05              1
5 Van2     2019-06-01              1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM