简体   繁体   English

如何检查数据框的字符串列是否与另一个数据框的字符串列匹配?

[英]How to check if a string column of a data frame matches with a string column of another data frame?

I'm intending to perform a left join of two data frames using a common character column (let's call this name column).我打算使用公共字符列(我们称之为名称列)执行两个数据框的左连接。 Ideally, all the values in the name column of df1 would have a match with the name column of df2.理想情况下,df1 的名称列中的所有值都将与 df2 的名称列匹配。 However, I understand some values may not match exactly but would have a partial match due to some spelling or punctuation error.但是,我了解某些值可能不完全匹配,但由于某些拼写或标点符号错误会部分匹配。 For eg.例如。 "John Ezekiel" could be spelled as "John Ezekial" in the df1. “John Ezekiel”在 df1 中可以拼写为“John Ezekial”。 I want to ensure that each value of the name column in df1 will have a match with some value in the name column of df2.我想确保 df1 中 name 列的每个值都与 df2 的 name 列中的某个值匹配。 In most cases, there would be an exact match but in the cases where there are not exact matches, I also want to replace those values of df1 with values of df2 that has most partial match.在大多数情况下,会有完全匹配,但在没有完全匹配的情况下,我还想用最部分匹配的 df2 值替换 df1 的值。 I've illustrated this problem by reproducing我已经通过复制来说明这个问题

df1 <- data.frame(name=c("John Ezekial","Mary Elizabeth","Fabio Fonini","Gael Monfils","Lucas Pouile"))    
df2 <- data.frame(name=c("Aron Ramsey","John Doe","John Ezekiel","Mary Elizabeth","Fabio Fognini","Gael Monfils","Marin Cilic","Lucas Pouille","Tomas Berdych","Juan Martin Del Potro"),id=1:10)
> df1
            name
1   John Ezekial
2 Mary Elizabeth
3   Fabio Fonini
4   Gael Monfils
5   Lucas Pouile

> df2
                name     id
1            Aron Ramsey  1
2               John Doe  2
3           John Ezekiel  3
4         Mary Elizabeth  4
5          Fabio Fognini  5
6           Gael Monfils  6
7            Marin Cilic  7
8          Lucas Pouille  8
9          Tomas Berdych  9
10 Juan Martin Del Potro 10

When df1 is performed left join with df2, I get results as follows:当 df1 与 df2 执行左连接时,我得到如下结果:

> df1 %>% left_join(df2)
Joining, by = "name"
            name id
1   John Ezekial NA
2 Mary Elizabeth  4
3   Fabio Fonini NA
4   Gael Monfils  6
5   Lucas Pouile NA

I want the resulting data frame to be as shown below.我希望生成的数据框如下所示。 The "name" values in df1 should be replaced with "name" values in df2 if there are no exact matches and mapped to their corresponding id's.如果没有完全匹配并映射到它们相应的 id,则 df1 中的“name”值应替换为 df2 中的“name”值。

   >df3
            name id
1   John Ezekiel 3
2 Mary Elizabeth 4
3   Fabio Fognini 5
4   Gael Monfils 6
5   Lucas Pouille 8

Using Base R you could do something like:使用 Base R,您可以执行以下操作:

 df2[which(adist(df1$name,df2$name)<2,T)[,2],]
                name id
    3   John Ezekiel  3
    4 Mary Elizabeth  4
    5  Fabio Fognini  5
    6   Gael Monfils  6
    8  Lucas Pouille  8

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何检查向量中的任何字符串是否存在于包含多个潜在匹配项的数据框列中 - How to check if any string in a vector is present in a data frame column containing multiple potential matches 将字符串连接到数据框列 - Concatenate string to data frame column 如何通过匹配来自另一个数据帧的整个列中的字符串来检索一个数据帧中的值? - How to retrieve value in one data frame by matching a string within an entire column from another data frame? 根据数据框中另一列中的字符串从列中获取值 - get a value from column based on string in another column in data frame R-如何将数据框的列添加为另一个数据框的列? - R - How to add column of data frame as column of another data frame? 检查数据框中的单元格是否与另一列相同 - check if cells in data frame is identical to another column “如何获取字符串列表并根据另一列中的字符串插入到新的数据框列中?” - “How to take a list of strings and insert into a new data frame column based on string in another Column?” 如何根据另一列中的部分字符串向 R 中的数据框添加一列? - How do I add a column to a data frame in R based on a partial string in another column? 如何在R中的数据帧中找到一列中出现字符串最长的时间以及另一列中对应的第一个和最后一个值? - How to find the longest occurrence of a string in a column and corresponding first and last values from another column in a data frame in R? 如何使用R从一列的字符串中提取特定数字并将其存储在数据框的另一列中? - How to extract a particular number from a string of one column and store it in another column of data frame using R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM