简体   繁体   中英

The R code for left join, working with Dates that are matching and others not matching

dfy<-tibble(ttc= c("830592962A","701134213K","620001491E","500542890M","400259766M","800136692H","701229741E"),
            CaseDate1=c("01/04/2019","01/04/2019","02/04/2019","02/04/2019","02/04/2019","02/04/2019","03/04/2019"),
            Theatre=c("RIE_TH_06","RIE_TH_06","RIE_TH_08","RIE_TH_08","RIE_TH_06","RIE_TH_06","RIE_TH_08"))

dss<-tibble(ttc=c("400259766M","800136692H","701229741E","830592962A","701134213K","620001491E","500542890M"),
            D1=c("NA","01/04/2019","NA","01/04/2019","01/04/2019","02/04/2019","NA"),
            D2=c("02/04/2019","NA","NA","NA","NA","NA","02/04/2019"),
            D3=c("NA","NA","04/04/2019","NA","NA","NA","NA"),
            C5=c("APPLE","ORANGE","PINE","MANGO","CHERRY","SUGAR","GREEN"))
  1. Firstly i want to left joint the file based on exact matches of
dfy(ttc&CaseDate1)

dss(ttc& coalesce(D1,D2,D3))
  1. Secondly, where there is no exact matches i want to use (a day before or a day after in dss(ttc& coalesce(D1,D2,D3))

  2. dfy( 701229741E& 03/04/2019) will mathc into dss(701229741E&04/04/201) a day after or a day before

I have used the following code and has joined only the matching ttc& dates

dfy %>% 
  left_join(dss %>% mutate(x = coalesce(D1, D2, D3)), by = c("ttc", "CaseDate1" = "x")) %>% 
  select(ttc, CaseDate1, Theatre, C5)

Coalesce is not working as intended because in the data "NA" is a string, not a missing data. I fixed that with

for (c in c('D1', 'D2', 'D3')) {
  dss[c][dss[c] == 'NA'] = NA
}

Now your same code returns

# A tibble: 7 x 4
  ttc        CaseDate1  Theatre   C5    
  <chr>      <chr>      <chr>     <chr> 
1 830592962A 01/04/2019 RIE_TH_06 MANGO 
2 701134213K 01/04/2019 RIE_TH_06 CHERRY
3 620001491E 02/04/2019 RIE_TH_08 SUGAR 
4 500542890M 02/04/2019 RIE_TH_08 GREEN 
5 400259766M 02/04/2019 RIE_TH_06 APPLE 
6 800136692H 02/04/2019 RIE_TH_06 NA    
7 701229741E 03/04/2019 RIE_TH_08 NA   

For the missing date, my suggestion would be use a full_join instead of left_join , and use the fill function in a grouped dataframe:

dfy %>% 
  full_join(dss %>% mutate(x = coalesce(D1, D2, D3)), by = c("ttc", "CaseDate1" = "x")) %>% 
  select(ttc, CaseDate1, Theatre, C5) %>%
  group_by(ttc) %>%
  arrange(desc(CaseDate1)) %>%
  fill(C5) %>%
  filter(!is.na(Theatre)) %>%
  ungroup() %>%
  arrange(CaseDate1)

outputs

# A tibble: 7 x 4
  ttc        CaseDate1  Theatre   C5    
  <chr>      <chr>      <chr>     <chr> 
1 830592962A 01/04/2019 RIE_TH_06 MANGO 
2 701134213K 01/04/2019 RIE_TH_06 CHERRY
3 620001491E 02/04/2019 RIE_TH_08 SUGAR 
4 500542890M 02/04/2019 RIE_TH_08 GREEN 
5 400259766M 02/04/2019 RIE_TH_06 APPLE 
6 800136692H 02/04/2019 RIE_TH_06 NA    
7 701229741E 03/04/2019 RIE_TH_08 PINE  

filter(.is.na(Theatre)) here is dropping whatever was not in the dfy (the "left" dataframe).

If you want to fill in both directions, you can add the .direction argument to the fill function.

dfy %>% 
  full_join(dss %>% mutate(x = coalesce(D1, D2, D3)), by = c("ttc", "CaseDate1" = "x")) %>% 
  select(ttc, CaseDate1, Theatre, C5) %>%
  group_by(ttc) %>%
  arrange(desc(CaseDate1)) %>%
  fill(C5, .direction='updown') %>%
  filter(!is.na(Theatre)) %>%
  ungroup() %>%
  arrange(CaseDate1)

and outputs

# A tibble: 7 x 4
  ttc        CaseDate1  Theatre   C5    
  <chr>      <chr>      <chr>     <chr> 
1 830592962A 01/04/2019 RIE_TH_06 MANGO 
2 701134213K 01/04/2019 RIE_TH_06 CHERRY
3 620001491E 02/04/2019 RIE_TH_08 SUGAR 
4 500542890M 02/04/2019 RIE_TH_08 GREEN 
5 400259766M 02/04/2019 RIE_TH_06 APPLE 
6 800136692H 02/04/2019 RIE_TH_06 ORANGE
7 701229741E 03/04/2019 RIE_TH_08 PINE  

It is not clear to me that this is your intended output, but I hope it helps you in the right direction.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM