简体   繁体   中英

Select Rows from Different Table where String from 1st table column is present in R

I am trying to match tables if a string is fully present in the other tables' column. However, I have managed to join it partially and then I am applying Levenstein distance to get close matches. This approach has limited use and accuracy. Approach:

checkg <- check %>% 
  fuzzy_inner_join(LOCATIONS, by = c("STRING" = "STRING"), match_fun = str_detect) %>%
  rowwise() %>%
  mutate(DIST = adist(x=STRING, y=LOCATION, ignore.case = TRUE)) 

is there any way to map it in the following way? The STATUS column in the output table is just given to make it clear that partial string matching is not the objective. It is not required in the output. Thanks

TABLE 1

**STRING** 
BATANGAS
QINGDAO

TABLE2

**STRING**
BATNAGAS LUZON
QINGDAO PT

OUTPUT TABLE checkg

TABLE1.STRING   TABLE2.STRING    STATUS
BATANGAS        BATNAGAS LUZON   Accept
QINGDAO         QINGDAO PT       Accept
BATANGAS        TANGA            Reject

You can reverse the syntax to avoid partial matching from LOCATIONS table.

library(fuzzyjoin)

check <- data.frame(STRING = c("BATANGAS", "QINGDAO"))
LOCATIONS <- data.frame(STRING = c("BATANGAS LUZON", "QINGDAO PT", "TANGA"))

LOCATIONS %>% 
  fuzzy_right_join(check, by = c("STRING" = "STRING"), match_fun = str_detect)

        STRING.x STRING.y
1 BATANGAS LUZON BATANGAS
2     QINGDAO PT  QINGDAO

To check further for full words only, you can do this..

check <- structure(list(To_check = c("BATANGAS", "QINGDAO", "ABC", "DEF"
), id = 1:4), class = "data.frame", row.names = c(NA, -4L))

check
> check
  To_check id
1 BATANGAS  1
2  QINGDAO  2
3      ABC  3
4      DEF  4

> LOCATIONS
          STRING
1 BATANGAS LUZON
2     QINGDAO PT
3          TANGA
4           ABCD

LOCATIONS %>% 
  fuzzy_right_join(check %>% mutate(dummy = paste0('\\b', To_check, '\\b')), 
                   by = c("STRING" = "dummy"), match_fun = str_detect) %>%
  select(-dummy)

          STRING To_check id
1 BATANGAS LUZON BATANGAS  1
2     QINGDAO PT  QINGDAO  2
3           <NA>      ABC  3
4           <NA>      DEF  4

needless to say you can use fuzzy_inner_join for having matched results only

It depends on the nature of your tables but in general this is the solution I came up with:

Table1 <- data.table(STRING = c("BATANGAS", "QINGDAO"))
Table2 <- data.table(STRING = c("BATANGAS LUZON", "QINGDAO PT", "TANGA"))

Table3 <- as.data.table(stringdist_join(Table1, Table2, by = "STRING", max_dist = 6, method = "lv", 
                                        mode = "full", distance_col = "STATUS"))

I am not familiar enough with dplyr to replicate it there so I am using data.table in my example.

This code will produce the following result:

STRING.x    STRING.y          STATUS
BATANGAS    BATANGAS LUZON    6
BATANGAS    TANGA             3
QINGDAO     QINGDAO PT        3
QINGDAO     TANGA             4

Now it gets a bit tricky. I can imagine that you don't want TANGA to match with two different values in STRING.x. However in this example you do want BATANGAS to match with 2 different values in STRING.y. If you want to always remove duplicates from STRING.y you can do so by using this:

Table3 <- Table3[ , .SD[which.min(STATUS)], by = STRING.y]

which will produce:

STRING.y          STRING.x    STATUS
BATANGAS LUZON    BATANGAS    6
TANGA             BATANGAS    3
QINGDAO PT        QINGDAO     3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM