简体   繁体   中英

Is there a way to merge two dataframes with slightly different string values in columns in R?

I am working on an R task, that includes working with 2 separate data frames. And I need to merge them by one column (with geographical names), in which values are sometimes a bit different like:

"A Coruna" and "Coruna, A", "Alicante/Alacant" and "Alicante", "Santa Cruz de Tenerife" and "4 Santa Cruz".

Pairs lake those should be taken as the same values when merging the data frames. So the result of merging would be a data frame like:

province | males.2018 | males.2013 | area

Is there some way to do it, without using extra libraries?

Thank you

第一个数据框 第二个数据框

I think the easy way is to fix the Province names in both data frames: use the ISO 3166-2:ES codes added in a new column instead. If you paste the data as the output of dput to the question, I can provide a code to do that.

I'm not sure how to do this without any external packages, perhaps using agrep ?

The fuzzyjoin package was developed precisely for this situation. So why not take advantage of it. However, the values that you want to match on don't seem very similar, even though you said: "sometimes the values are a bit different ". So the fuzzyjoin solution may not be able to help you here. You can see from the following:

library(fuzzyjoin)

df1 <- data.frame(province1=c("A Coruna", "Alicante/Alacant", "Santa Cruz de Tenerife"))
df2 <- data.frame(province1=c("Coruna, A", "Alicante", "4 Santa Cruz"))

data.frame(df1, df2)
               province1  province1.1
1               A Coruna    Coruna, A
2       Alicante/Alacant     Alicante
3 Santa Cruz de Tenerife 4 Santa Cruz

The following attempt to merge returns no matches:

merge(df1, df2, by = "province1")
# <0 rows> (or 0-length row.names)

Now try fuzzy matching. The default distance used for joining is 2.

stringdist_inner_join(df1, df2, by = "province1")
# A tibble: 0 x 2
# ... with 2 variables: province1.x <chr>, province1.y <fct>

This returns no records. So try increasing the distance threshold. For this small example, the first record needs a max.distance of 5 to be deemed a match.

stringdist_inner_join(df1, df2, by = "province1", max_dist = 5)
# A tibble: 1 x 2
  province1.x province1.y
  <chr>       <fct>      
1 A Coruna    Coruna, A

You have to increase the threshold further to get more matches. But doing this fails because "A Coruna" also matches "Alicante"!

stringdist_inner_join(df1, df2, by = "province1", max_dist = 7)
# A tibble: 2 x 2
  province1.x province1.y
  <chr>       <fct>      
1 A Coruna    Coruna, A  
2 A Coruna    Alicante

Increasing the threshold to 8 gets "Alicante", but it is still matched with "A Coruna".

stringdist_inner_join(df1, df2, by = "province1", max_dist = 8, distance_col = "dis")
# A tibble: 3 x 3
  province1.x      province1.y   dis
  <chr>            <fct>       <dbl>
1 A Coruna         Coruna, A       5
2 A Coruna         Alicante        7
3 Alicante/Alacant Alicante        8

So you can see, this isn't going to work for values that are not very similar. You may need to do some data cleaning prior to using this method. There are various methods in this function that you can try. Or use some iterative approach with increasing thresholds so that records that have already been matched are not matched again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM