简体   繁体   中英

inner join on two dataframes based on an exact match for one column and fuzzy match for two columns

I'd like to perform an exact match on one of my columns (Product_date) followed with a partial match or fuzzy match for product_name and state_name.

For example:

df1 <- data.frame(ID=c("P01", "P04", "P23"),
                  Product_name=c("Jewel", "Bronze", "Iron"), 
                  Product_state=c("Kansas", "Illinois", "Florida"),
                  Product_date=c("2021-08-01", "2021-01-01", "2020-12-21"))

df2 <- data.frame(
  Product_name=c("Jewel", "Bro", "Ir", "Uknw"), 
  Product_state=c("Kansasss", "IllI", "Flor_ida", "Cali2"),
  Product_date=c("2021-08-01", "2021-01-01", "2020-12-21", "2020-09"),
  Product_status=c("sold", "lost", "sold", "sold"))

desired_df <-  data.frame(c("P01", "P04", "P23"),
                          Product_name=c("Jewel", "Bronze", "Iron"), 
                          Product_state=c("Kansas", "Illinois", "Florida"),
                          Product_date=c("2021-08-01", "2021-01-01", "2020-12-21"), 
                          Product_name=c("Je", "Bro", "Ir"), 
                          Product_state=c("Kansasss", "IllI", "Flor_ida"),
                          Product_date=c("2021-08-01", "2021-01-01", "2020-12-21"), 
                          Product_status=c("sold", "lost", "sold"))

Just for illustrative purposes this is what the code in my head looks like (but of course it doesn't work)

matched <- df1 %>%
stringdist_inner_join(df2, by= c("Product_name", max_dist=2),
                           by= c("Product_stat", max_dist=4), 
                           by = c("Product_date"))

A possible solution:

library(fuzzyjoin)
library(dplyr)

stringdist_join(df1, df2, 
                by = c("Product_name","Product_state"),
                mode = "left",
                ignore_case = FALSE, 
                method = "jw", 
                max_dist = 0.5) %>% 
  filter(Product_date.x == Product_date.y)
#>    ID Product_name.x Product_state.x Product_date.x Product_name.y
#> 1 P01          Jewel          Kansas     2021-08-01          Jewel
#> 2 P04         Bronze        Illinois     2021-01-01            Bro
#> 3 P23           Iron         Florida     2020-12-21             Ir
#>   Product_state.y Product_date.y Product_status
#> 1        Kansasss     2021-08-01           sold
#> 2            IllI     2021-01-01           lost
#> 3        Flor_ida     2020-12-21           sold

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM