如何通过 R 中的单词（不是字母）模糊匹配？

Question

I need to merge two datasets based on columns that contain names that don't exaclty match, sometimes because one of the columns has a missing name with respect to the other.我需要基于包含不完全匹配的名称的列合并两个数据集，有时是因为其中一列相对于另一列缺少名称。 For example, in one column I have "Martín Gallardo" and in the other I have "Martín Ricardo Gallardo" .例如，在一个列中我有"Martín Gallardo" ，而在另一列中我有"Martín Ricardo Gallardo" 。 Another problem is that in some first and last name appear reversed, like "Martín Gallardo" in one and "Gallardo Martín" in the other.另一个问题是，在某些名字和姓氏中出现颠倒，例如"Martín Gallardo"在一个中， "Gallardo Martín"在另一个中。 How can I match this using R?如何使用 R 匹配这个？ My first thought was to use str_split in both and assign each on one set to the one that matches more elements from the other set, but I'm not sure how to code this.我的第一个想法是在两者中都使用str_split并将一个集合上的每个分配给与另一个集合中更多元素匹配的那个，但我不知道如何编码。

Thank you.谢谢你。

Edit: data looks something like this编辑：数据看起来像这样

A <- tibble(email=c("martingallardo23@gmail.com","raulgimenez@gmail.com"), 
name=c("martin", "raul"), last_name=c("gallardo","gimenez"), 
full_name=c("martin gallardo", "raul gimenez"))
A
#  A tibble: 2 x 4
#   email                      name   last_name full_name
#   <chr>                      <chr>  <chr>     <chr>          
# 1 martingallardo23@gmail.com martin gallardo  martin gallardo
# 2 raulgimenez@gmail.com      raul   gimenez   raul gimenez   

B <- tibble(email=c("martingallardo@gmail.com", "raulgimenez2@gmail.com"), 
name=c("martin ricardo", "gimenez"), last_name=c("gallardo", "raul"), 
full_name=c("martin ricardo gallardo", "gimenez raul"), other_data=c("A", "B"))
B
# A tibble: 2 x 5
#   email                    name           last_name full_name              other_data
#   <chr>                    <chr>          <chr>     <chr>                   <chr>     
# 1 martingallardo@gmail.com martin ricardo gallardo  martin ricardo gallardo A         
# 2 raulgimenez2@gmail.com   gimenez        raul      gimenez raul            B

Answer 1

This is a tidyverse way to do the join.这是一种 tidyverse 方式来进行连接。 It basically finds full_name from B that has the highest number of common words with A. library(tidyverse)它基本上从 B 中找到与 A 的常用词数量最多的 full_name。 library(tidyverse)

A1 <- tibble(
  nombre_completo = c("martin gallardo", "raul gimenez")
  ) %>%
  mutate(
    id_A = row_number()
  )

B1 <- tibble(
  nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
  other_data=c("A", "B")
  ) %>%
  mutate(
    id_B = row_number()
  )


A2 <- A1 %>%
  mutate(
    name_words = str_split(nombre_completo, pattern = " ")
  ) %>%
  unnest(cols = c(name_words))

B2 <- B1 %>%
  mutate(
    name_words = str_split(nombre_completo, pattern = " ")
  ) %>%
  unnest(cols = c(name_words)) %>%
  select(name_words, id_B )


left_join(A2, B2, by = "name_words") %>%
  group_by(nombre_completo, id_A, id_B) %>%
  count() %>% ungroup() %>%
  group_by(nombre_completo, id_A) %>%
  slice_max(order_by = n) %>%
  select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
  left_join(B1, by = "id_B")

Answer 2

In order for these two data sets to be matched I first created a column nombre_completo2 in a restructured form of data set A based on how nombre_completo in data set A partially match the same column in data set B .为了匹配这两个数据集，我首先根据数据集A中的nombre_completo如何部分匹配数据集B中的同一列，以数据集A的重组形式创建了一个列nombre_completo2 。 Then I merged the two data sets so that the additional columns in data set B is added to the restructured form of A .然后我合并了这两个数据集，以便将数据集B中的附加列添加到A的重组形式中。 This is how I interpreted your desired output in the first place so I hope it will be useful to you:这就是我首先解释您想要的 output 的方式，所以我希望它对您有用：

A <- tibble(email=c("martingallardo23@gmail.com","raulgimenez@gmail.com"), 
            name=c("martin", "raul"), last_name=c("gallardo","gimenez"), 
            nombre_completo=c("martin gallardo", "raul gimenez"))


B <- tibble(email=c("martingallardo@gmail.com", "raulgimenez2@gmail.com"), 
            name=c("martin ricardo", "gimenez"), last_name=c("gallardo", "raul"), 
            nombre_completo=c("martin ricardo gallardo", "gimenez raul"), 
            other_data=c("A", "B"))

library(dplyr)
library(tidyr)
library(purrr)

A %>%
  rowwise() %>%
  mutate(nombre_completo2 = map_chr(nombre_completo, 
                                ~ B$nombre_completo
                                [str_detect(B$nombre_completo, str_sub(.x, 1L, 4L))])) %>%
  inner_join(B, by = c("nombre_completo2" = "nombre_completo")) %>%
  select(!ends_with(".y")) %>%
  rename_with(~ str_replace(., ".x", ""), ends_with(".x"))


# A tibble: 2 x 6
# Rowwise: 
  email                      name   last_name nombre_completo nombre_completo2       other_data
  <chr>                      <chr>  <chr>     <chr>           <chr>                  <chr>     
1 martingallardo23@gmail.com martin gallardo  martin gallardo martin ricardo gallar~ A         
2 raulgimenez@gmail.com      raul   gimenez   raul gimenez    gimenez raul           B

如何通过 R 中的单词（不是字母）模糊匹配？

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-04-01 15:49:31

解决方案2
1 2021-04-01 16:33:36

如何通过 R 中的单词（不是字母）模糊匹配？

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-04-01 15:49:31

解决方案2 1 2021-04-01 16:33:36

解决方案1
1 已采纳 2021-04-01 15:49:31

解决方案2
1 2021-04-01 16:33:36