I am working on an R task, that includes working with 2 separate data frames. And I need to merge them by one column (with geographical names), in which values are sometimes a bit different like:
"A Coruna" and "Coruna, A", "Alicante/Alacant" and "Alicante", "Santa Cruz de Tenerife" and "4 Santa Cruz".
Pairs lake those should be taken as the same values when merging the data frames. So the result of merging would be a data frame like:
province | males.2018 | males.2013 | area
Is there some way to do it, without using extra libraries?
Thank you
I think the easy way is to fix the Province names in both data frames: use the ISO 3166-2:ES codes added in a new column instead. If you paste the data as the output of dput
to the question, I can provide a code to do that.
I'm not sure how to do this without any external packages, perhaps using agrep
?
The fuzzyjoin package was developed precisely for this situation. So why not take advantage of it. However, the values that you want to match on don't seem very similar, even though you said: "sometimes the values are a bit different ". So the fuzzyjoin solution may not be able to help you here. You can see from the following:
library(fuzzyjoin)
df1 <- data.frame(province1=c("A Coruna", "Alicante/Alacant", "Santa Cruz de Tenerife"))
df2 <- data.frame(province1=c("Coruna, A", "Alicante", "4 Santa Cruz"))
data.frame(df1, df2)
province1 province1.1
1 A Coruna Coruna, A
2 Alicante/Alacant Alicante
3 Santa Cruz de Tenerife 4 Santa Cruz
The following attempt to merge returns no matches:
merge(df1, df2, by = "province1")
# <0 rows> (or 0-length row.names)
Now try fuzzy matching. The default distance used for joining is 2.
stringdist_inner_join(df1, df2, by = "province1")
# A tibble: 0 x 2
# ... with 2 variables: province1.x <chr>, province1.y <fct>
This returns no records. So try increasing the distance threshold. For this small example, the first record needs a max.distance
of 5 to be deemed a match.
stringdist_inner_join(df1, df2, by = "province1", max_dist = 5)
# A tibble: 1 x 2
province1.x province1.y
<chr> <fct>
1 A Coruna Coruna, A
You have to increase the threshold further to get more matches. But doing this fails because "A Coruna" also matches "Alicante"!
stringdist_inner_join(df1, df2, by = "province1", max_dist = 7)
# A tibble: 2 x 2
province1.x province1.y
<chr> <fct>
1 A Coruna Coruna, A
2 A Coruna Alicante
Increasing the threshold to 8 gets "Alicante", but it is still matched with "A Coruna".
stringdist_inner_join(df1, df2, by = "province1", max_dist = 8, distance_col = "dis")
# A tibble: 3 x 3
province1.x province1.y dis
<chr> <fct> <dbl>
1 A Coruna Coruna, A 5
2 A Coruna Alicante 7
3 Alicante/Alacant Alicante 8
So you can see, this isn't going to work for values that are not very similar. You may need to do some data cleaning prior to using this method. There are various methods in this function that you can try. Or use some iterative approach with increasing thresholds so that records that have already been matched are not matched again.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.