I am trying to match tables if a string is fully present in the other tables' column. However, I have managed to join it partially and then I am applying Levenstein distance to get close matches. This approach has limited use and accuracy. Approach:
checkg <- check %>%
fuzzy_inner_join(LOCATIONS, by = c("STRING" = "STRING"), match_fun = str_detect) %>%
rowwise() %>%
mutate(DIST = adist(x=STRING, y=LOCATION, ignore.case = TRUE))
is there any way to map it in the following way? The STATUS column in the output table is just given to make it clear that partial string matching is not the objective. It is not required in the output. Thanks
TABLE 1
**STRING**
BATANGAS
QINGDAO
TABLE2
**STRING**
BATNAGAS LUZON
QINGDAO PT
OUTPUT TABLE checkg
TABLE1.STRING TABLE2.STRING STATUS
BATANGAS BATNAGAS LUZON Accept
QINGDAO QINGDAO PT Accept
BATANGAS TANGA Reject
You can reverse the syntax to avoid partial matching from LOCATIONS table.
library(fuzzyjoin)
check <- data.frame(STRING = c("BATANGAS", "QINGDAO"))
LOCATIONS <- data.frame(STRING = c("BATANGAS LUZON", "QINGDAO PT", "TANGA"))
LOCATIONS %>%
fuzzy_right_join(check, by = c("STRING" = "STRING"), match_fun = str_detect)
STRING.x STRING.y
1 BATANGAS LUZON BATANGAS
2 QINGDAO PT QINGDAO
To check further for full words only, you can do this..
check <- structure(list(To_check = c("BATANGAS", "QINGDAO", "ABC", "DEF"
), id = 1:4), class = "data.frame", row.names = c(NA, -4L))
check
> check
To_check id
1 BATANGAS 1
2 QINGDAO 2
3 ABC 3
4 DEF 4
> LOCATIONS
STRING
1 BATANGAS LUZON
2 QINGDAO PT
3 TANGA
4 ABCD
LOCATIONS %>%
fuzzy_right_join(check %>% mutate(dummy = paste0('\\b', To_check, '\\b')),
by = c("STRING" = "dummy"), match_fun = str_detect) %>%
select(-dummy)
STRING To_check id
1 BATANGAS LUZON BATANGAS 1
2 QINGDAO PT QINGDAO 2
3 <NA> ABC 3
4 <NA> DEF 4
needless to say you can use fuzzy_inner_join
for having matched results only
It depends on the nature of your tables but in general this is the solution I came up with:
Table1 <- data.table(STRING = c("BATANGAS", "QINGDAO"))
Table2 <- data.table(STRING = c("BATANGAS LUZON", "QINGDAO PT", "TANGA"))
Table3 <- as.data.table(stringdist_join(Table1, Table2, by = "STRING", max_dist = 6, method = "lv",
mode = "full", distance_col = "STATUS"))
I am not familiar enough with dplyr to replicate it there so I am using data.table in my example.
This code will produce the following result:
STRING.x STRING.y STATUS
BATANGAS BATANGAS LUZON 6
BATANGAS TANGA 3
QINGDAO QINGDAO PT 3
QINGDAO TANGA 4
Now it gets a bit tricky. I can imagine that you don't want TANGA to match with two different values in STRING.x. However in this example you do want BATANGAS to match with 2 different values in STRING.y. If you want to always remove duplicates from STRING.y you can do so by using this:
Table3 <- Table3[ , .SD[which.min(STATUS)], by = STRING.y]
which will produce:
STRING.y STRING.x STATUS
BATANGAS LUZON BATANGAS 6
TANGA BATANGAS 3
QINGDAO PT QINGDAO 3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.