R 中的列之间的模糊匹配

Question

How can I measure the degree to which names are similar in r?如何测量 r 中名称的相似程度？ In other words, the degree to which a fuzzy match can be made.换句话说，可以进行模糊匹配的程度。

For example, I am working with a data frame that looks like this:例如，我正在使用如下所示的数据框：

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")

df1 <- data.frame(Name.1, Name.2)

df1
            Name.1             Name.2
1         gonzalez gonzalezsoldevilla
2 wassermanschultz            schultz
3   athanasopoulos    anthanasopoulos
4           armato             strain

It is clear from the data that rows 1 and 2 are similar enough to be confident that the name is the same.从数据中可以清楚地看出，第 1 行和第 2 行足够相似，可以确信名称相同。 Row 3 is the same name even though it is misspelled and the fourth row is completely different.第 3 行是相同的名称，即使拼写错误，而第 4 行完全不同。

As an output, I would like to create a third column that describes the degree of similarity between the names or returns a boolean of some kind to indicate a fuzzy match can be made.作为 output，我想创建第三列来描述名称之间的相似程度或返回某种 boolean 以指示可以进行模糊匹配。

Answer 1

There is in the package stringdist a function stingsim which gives you a number between 0 and 1 for similarities between strings. package stringdist中有一个 function stingsim ，它为您提供了一个介于 0 和 1 之间的数字，用于表示字符串之间的相似性。

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")
library(stringdist)

df1 <- data.frame(Name.1, Name.2)
df1$similar <- stringsim(Name.1, Name.2)
df1
#>             Name.1             Name.2   similar
#> 1         gonzalez gonzalezsoldevilla 0.4444444
#> 2 wassermanschultz            schultz 0.4375000
#> 3   athanasopoulos    anthanasopoulos 0.9333333
#> 4           armato             strain 0.1666667

R 中的列之间的模糊匹配

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-07-12 08:22:33

R 中的列之间的模糊匹配

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-07-12 08:22:33

解决方案1
3 已采纳 2020-07-12 08:22:33