在R / Python中部分匹配具有共同列（按字）的两个数据帧

Question

I have two dataframes as csv files where df1 has more rows than df2 : 我有两个数据框作为csv文件，其中df1比df2有更多行：

Df1

Name                         Count
xxx yyyyyy bbb cccc           15
fffdd 444 ggg                 20
kkbbb ccc dd 29p              5
22 cc pbc2 kmn3 b23 efgh      4
ccccccccc sss qqqq            2

Df2

Name
xxx yyyyyy bbb cccc
ccccccccc sss qqqq pppc
22 cc pbc2 kmn3 b23,efgh

I want to do partial matching(approximate/fuzzy matching) by matching either first two/three words. 我想通过匹配前两个/三个词来进行部分匹配（近似/模糊匹配）。 Basically the output will be like this: 基本上，输出将是这样的：

Output: 输出：

Name                       Count
xxx yyyyyy bbb cccc         15
22 cc pbc2 kmn3 b23 efgh    4
ccccccccc sss qqqq          2

By trying exact matching, I'm missing some of the rows. 通过尝试精确匹配，我错过了一些行。 I tried with agrep in R but somehow its not working and fuzzy matching is quite slow. 我在R中使用agrep进行了尝试，但是由于某种原因它无法正常工作并且模糊匹配非常慢。 Please suggest me a way to do this in R or python. 请建议我用R或python做到这一点的方法。 Any help is appreciated! 任何帮助表示赞赏！

Answer 1

In R, you can use agrep for fuzzy matching. 在R中，可以使用agrep进行模糊匹配。 You can use the max.distance parameter to set the maximum distance allowed for a match. 您可以使用max.distance参数设置比赛允许的最大距离。

DF1[sapply(DF2$Name, agrep, DF1$Name, max.distance = 0.2), ]

#                       Name Count
# 1      xxx yyyyyy bbb cccc    15
# 5       ccccccccc sss qqqq     2
# 4 22 cc pbc2 kmn3 b23 efgh     4

The data: 数据：

DF1 <- read.table(text = "Name                         Count
'xxx yyyyyy bbb cccc'           15
'fffdd 444 ggg '                20
'kkbbb ccc dd 29p'              5
'22 cc pbc2 kmn3 b23 efgh'      4
'ccccccccc sss qqqq'           2", header = TRUE, stringsAsFactors = FALSE)

DF2 <- read.table(text = "Name
'xxx yyyyyy bbb cccc'
'ccccccccc sss qqqq pppc'
'22 cc pbc2 kmn3 b23,efgh'", header = TRUE, stringsAsFactors = FALSE)

在R / Python中部分匹配具有共同列（按字）的两个数据帧

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-12-16 10:50:20

在R / Python中部分匹配具有共同列（按字）的两个数据帧

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-12-16 10:50:20

解决方案1
2 已采纳 2014-12-16 10:50:20