[英]How to check if a string column of a data frame matches with a string column of another data frame?
I'm intending to perform a left join of two data frames using a common character column (let's call this name column).我打算使用公共字符列(我们称之为名称列)执行两个数据框的左连接。 Ideally, all the values in the name column of df1 would have a match with the name column of df2.
理想情况下,df1 的名称列中的所有值都将与 df2 的名称列匹配。 However, I understand some values may not match exactly but would have a partial match due to some spelling or punctuation error.
但是,我了解某些值可能不完全匹配,但由于某些拼写或标点符号错误会部分匹配。 For eg.
例如。 "John Ezekiel" could be spelled as "John Ezekial" in the df1.
“John Ezekiel”在 df1 中可以拼写为“John Ezekial”。 I want to ensure that each value of the name column in df1 will have a match with some value in the name column of df2.
我想确保 df1 中 name 列的每个值都与 df2 的 name 列中的某个值匹配。 In most cases, there would be an exact match but in the cases where there are not exact matches, I also want to replace those values of df1 with values of df2 that has most partial match.
在大多数情况下,会有完全匹配,但在没有完全匹配的情况下,我还想用最部分匹配的 df2 值替换 df1 的值。 I've illustrated this problem by reproducing
我已经通过复制来说明这个问题
df1 <- data.frame(name=c("John Ezekial","Mary Elizabeth","Fabio Fonini","Gael Monfils","Lucas Pouile"))
df2 <- data.frame(name=c("Aron Ramsey","John Doe","John Ezekiel","Mary Elizabeth","Fabio Fognini","Gael Monfils","Marin Cilic","Lucas Pouille","Tomas Berdych","Juan Martin Del Potro"),id=1:10)
> df1
name
1 John Ezekial
2 Mary Elizabeth
3 Fabio Fonini
4 Gael Monfils
5 Lucas Pouile
> df2
name id
1 Aron Ramsey 1
2 John Doe 2
3 John Ezekiel 3
4 Mary Elizabeth 4
5 Fabio Fognini 5
6 Gael Monfils 6
7 Marin Cilic 7
8 Lucas Pouille 8
9 Tomas Berdych 9
10 Juan Martin Del Potro 10
When df1 is performed left join with df2, I get results as follows:当 df1 与 df2 执行左连接时,我得到如下结果:
> df1 %>% left_join(df2)
Joining, by = "name"
name id
1 John Ezekial NA
2 Mary Elizabeth 4
3 Fabio Fonini NA
4 Gael Monfils 6
5 Lucas Pouile NA
I want the resulting data frame to be as shown below.我希望生成的数据框如下所示。 The "name" values in df1 should be replaced with "name" values in df2 if there are no exact matches and mapped to their corresponding id's.
如果没有完全匹配并映射到它们相应的 id,则 df1 中的“name”值应替换为 df2 中的“name”值。
>df3
name id
1 John Ezekiel 3
2 Mary Elizabeth 4
3 Fabio Fognini 5
4 Gael Monfils 6
5 Lucas Pouille 8
Using Base R you could do something like:使用 Base R,您可以执行以下操作:
df2[which(adist(df1$name,df2$name)<2,T)[,2],]
name id
3 John Ezekiel 3
4 Mary Elizabeth 4
5 Fabio Fognini 5
6 Gael Monfils 6
8 Lucas Pouille 8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.