[英]Partial Matching two data frames having a common column(by words) in R/Python
I have two dataframes as csv files where df1
has more rows than df2
: 我有两个数据框作为csv文件,其中df1
比df2
有更多行:
Df1
Name Count
xxx yyyyyy bbb cccc 15
fffdd 444 ggg 20
kkbbb ccc dd 29p 5
22 cc pbc2 kmn3 b23 efgh 4
ccccccccc sss qqqq 2
Df2
Name
xxx yyyyyy bbb cccc
ccccccccc sss qqqq pppc
22 cc pbc2 kmn3 b23,efgh
I want to do partial matching(approximate/fuzzy matching) by matching either first two/three words. 我想通过匹配前两个/三个词来进行部分匹配(近似/模糊匹配)。 Basically the output will be like this: 基本上,输出将是这样的:
Output: 输出:
Name Count
xxx yyyyyy bbb cccc 15
22 cc pbc2 kmn3 b23 efgh 4
ccccccccc sss qqqq 2
By trying exact matching, I'm missing some of the rows. 通过尝试精确匹配,我错过了一些行。 I tried with agrep
in R but somehow its not working and fuzzy matching is quite slow. 我在R中使用agrep
进行了尝试,但是由于某种原因它无法正常工作并且模糊匹配非常慢。 Please suggest me a way to do this in R or python. 请建议我用R或python做到这一点的方法。 Any help is appreciated! 任何帮助表示赞赏!
In R, you can use agrep
for fuzzy matching. 在R中,可以使用agrep
进行模糊匹配。 You can use the max.distance
parameter to set the maximum distance allowed for a match. 您可以使用max.distance
参数设置比赛允许的最大距离。
DF1[sapply(DF2$Name, agrep, DF1$Name, max.distance = 0.2), ]
# Name Count
# 1 xxx yyyyyy bbb cccc 15
# 5 ccccccccc sss qqqq 2
# 4 22 cc pbc2 kmn3 b23 efgh 4
The data: 数据:
DF1 <- read.table(text = "Name Count
'xxx yyyyyy bbb cccc' 15
'fffdd 444 ggg ' 20
'kkbbb ccc dd 29p' 5
'22 cc pbc2 kmn3 b23 efgh' 4
'ccccccccc sss qqqq' 2", header = TRUE, stringsAsFactors = FALSE)
DF2 <- read.table(text = "Name
'xxx yyyyyy bbb cccc'
'ccccccccc sss qqqq pppc'
'22 cc pbc2 kmn3 b23,efgh'", header = TRUE, stringsAsFactors = FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.