简体   繁体   English

如何通过 R 中的单词(不是字母)模糊匹配?

[英]How to fuzzy match by words (not letters) in R?

I need to merge two datasets based on columns that contain names that don't exaclty match, sometimes because one of the columns has a missing name with respect to the other.我需要基于包含不完全匹配的名称的列合并两个数据集,有时是因为其中一列相对于另一列缺少名称。 For example, in one column I have "Martín Gallardo" and in the other I have "Martín Ricardo Gallardo" .例如,在一个列中我有"Martín Gallardo" ,而在另一列中我有"Martín Ricardo Gallardo" Another problem is that in some first and last name appear reversed, like "Martín Gallardo" in one and "Gallardo Martín" in the other.另一个问题是,在某些名字和姓氏中出现颠倒,例如"Martín Gallardo"在一个中, "Gallardo Martín"在另一个中。 How can I match this using R?如何使用 R 匹配这个? My first thought was to use str_split in both and assign each on one set to the one that matches more elements from the other set, but I'm not sure how to code this.我的第一个想法是在两者中都使用str_split并将一个集合上的每个分配给与另一个集合中更多元素匹配的那个,但我不知道如何编码。

Thank you.谢谢你。

Edit: data looks something like this编辑:数据看起来像这样

A <- tibble(email=c("martingallardo23@gmail.com","raulgimenez@gmail.com"), 
name=c("martin", "raul"), last_name=c("gallardo","gimenez"), 
full_name=c("martin gallardo", "raul gimenez"))
A
#  A tibble: 2 x 4
#   email                      name   last_name full_name
#   <chr>                      <chr>  <chr>     <chr>          
# 1 martingallardo23@gmail.com martin gallardo  martin gallardo
# 2 raulgimenez@gmail.com      raul   gimenez   raul gimenez   

B <- tibble(email=c("martingallardo@gmail.com", "raulgimenez2@gmail.com"), 
name=c("martin ricardo", "gimenez"), last_name=c("gallardo", "raul"), 
full_name=c("martin ricardo gallardo", "gimenez raul"), other_data=c("A", "B"))
B
# A tibble: 2 x 5
#   email                    name           last_name full_name              other_data
#   <chr>                    <chr>          <chr>     <chr>                   <chr>     
# 1 martingallardo@gmail.com martin ricardo gallardo  martin ricardo gallardo A         
# 2 raulgimenez2@gmail.com   gimenez        raul      gimenez raul            B   

This is a tidyverse way to do the join.这是一种 tidyverse 方式来进行连接。 It basically finds full_name from B that has the highest number of common words with A. library(tidyverse)它基本上从 B 中找到与 A 的常用词数量最多的 full_name。 library(tidyverse)

A1 <- tibble(
  nombre_completo = c("martin gallardo", "raul gimenez")
  ) %>%
  mutate(
    id_A = row_number()
  )

B1 <- tibble(
  nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
  other_data=c("A", "B")
  ) %>%
  mutate(
    id_B = row_number()
  )


A2 <- A1 %>%
  mutate(
    name_words = str_split(nombre_completo, pattern = " ")
  ) %>%
  unnest(cols = c(name_words))

B2 <- B1 %>%
  mutate(
    name_words = str_split(nombre_completo, pattern = " ")
  ) %>%
  unnest(cols = c(name_words)) %>%
  select(name_words, id_B )


left_join(A2, B2, by = "name_words") %>%
  group_by(nombre_completo, id_A, id_B) %>%
  count() %>% ungroup() %>%
  group_by(nombre_completo, id_A) %>%
  slice_max(order_by = n) %>%
  select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
  left_join(B1, by = "id_B")

In order for these two data sets to be matched I first created a column nombre_completo2 in a restructured form of data set A based on how nombre_completo in data set A partially match the same column in data set B .为了匹配这两个数据集,我首先根据数据集A中的nombre_completo如何部分匹配数据集B中的同一列,以数据集A的重组形式创建了一个列nombre_completo2 Then I merged the two data sets so that the additional columns in data set B is added to the restructured form of A .然后我合并了这两个数据集,以便将数据集B中的附加列添加到A的重组形式中。 This is how I interpreted your desired output in the first place so I hope it will be useful to you:这就是我首先解释您想要的 output 的方式,所以我希望它对您有用:

A <- tibble(email=c("martingallardo23@gmail.com","raulgimenez@gmail.com"), 
            name=c("martin", "raul"), last_name=c("gallardo","gimenez"), 
            nombre_completo=c("martin gallardo", "raul gimenez"))


B <- tibble(email=c("martingallardo@gmail.com", "raulgimenez2@gmail.com"), 
            name=c("martin ricardo", "gimenez"), last_name=c("gallardo", "raul"), 
            nombre_completo=c("martin ricardo gallardo", "gimenez raul"), 
            other_data=c("A", "B"))

library(dplyr)
library(tidyr)
library(purrr)

A %>%
  rowwise() %>%
  mutate(nombre_completo2 = map_chr(nombre_completo, 
                                ~ B$nombre_completo
                                [str_detect(B$nombre_completo, str_sub(.x, 1L, 4L))])) %>%
  inner_join(B, by = c("nombre_completo2" = "nombre_completo")) %>%
  select(!ends_with(".y")) %>%
  rename_with(~ str_replace(., ".x", ""), ends_with(".x"))


# A tibble: 2 x 6
# Rowwise: 
  email                      name   last_name nombre_completo nombre_completo2       other_data
  <chr>                      <chr>  <chr>     <chr>           <chr>                  <chr>     
1 martingallardo23@gmail.com martin gallardo  martin gallardo martin ricardo gallar~ A         
2 raulgimenez@gmail.com      raul   gimenez   raul gimenez    gimenez raul           B 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM