[英]Merge two dataframes by partial string match
I am trying to merge two fairly large dataframes of different sizes based on partial string matches. 我试图基于部分字符串匹配来合并两个不同大小的相当大的数据帧。
df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code. df1 $ code包含所有12位数字代码,而df2 $ code包含10-12位数字的代码组合,其中一些较短的代码是与df1 $ code中12位数字代码匹配的子字符串。
Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1. 因此,我需要合并两个数据帧之间的所有12位数字匹配项,还要合并df2中具有10-11位数字代码的记录,这些记录是与df1的子字符串匹配。
Example dataframes: 示例数据框:
df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))
df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))
df3 (merged)
code_1 code_2 name color
123456789012 123456789012 bob blue
210987654321 2109876543 joe red
567890543211 7890543211 sally green
987656789001 98765678900 john purple
123456654321 12345665432 lucy orange
678905432156 678905432156 alan brown
Try this SQL join. 尝试此SQL连接。
library(sqldf)
sqldf("select a.code_1, b.code_2, a.name, b.color
from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")
giving: 赠送:
code_1 code_2 name color
1 123456789012 123456789012 bob blue
2 210987654321 2109876543 joe red
3 567890543211 7890543211 sally green
4 987656789001 98765678900 john purple
5 123456654321 12345665432 lucy orange
6 678905432156 678905432156 alan brown
Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1
and code_2
. 更新:更新了答案以反映所讨论的更改,以便(1)子字符串可以在目标字符串中的任何位置,并且(2)代码列的名称已更改为code_1
和code_2
。
Updated per new info. 根据新信息更新。 This should work: 这应该工作:
df2$New <- lapply(df2$code_2, grep, df1$code_1,value=T)
combined <- merge(df1,df2, by.x="code_1", by.y="New")
code_1 name code_2 color
1 123456654321 lucy 12345665432 orange
2 123456789012 bob 123456789012 blue
3 210987654321 joe 2109876543 red
4 567890543211 sally 7890543211 green
5 678905432156 alan 678905432156 brown
6 987656789001 john 98765678900 purple
We can use grep
+ sapply
to extract indices of matches from df2$code
for each df1$code
and create a matchID
out of it. 我们可以使用grep
+ sapply
从df2$code
为每个df1$code
提取匹配索引,并在其中创建一个matchID
。 Next, we merge
on matchID
to get desired output: 接下来,我们在matchID
上merge
以获得所需的输出:
df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))
df_merge = merge(df1, df2, by = "matchID")[-1]
Note that if a df1$code
does not match any df2$code
, df2$matchID
will be blank, and so would not merge with df1$matchID
. 请注意,如果df1$code
与任何df2$code
不匹配,则df2$matchID
将为空,因此不会与df1$matchID
合并。
Results: 结果:
> df2
code color matchID
1 123456789012 blue 1
2 2109876543 red 2
3 7890543211 green 3
4 98765678900 purple 4
5 12345665432 orange 5
6 678905432156 brown 6
7 14124124124 black
> df_merge
code.x name code.y color
1 123456789012 bob 123456789012 blue
2 210987654321 joe 2109876543 red
3 567890543211 sally 7890543211 green
4 987656789001 john 98765678900 purple
5 123456654321 lucy 12345665432 orange
6 678905432156 alan 678905432156 brown
Data (Added non-match for better demo): 数据(添加了不匹配项以获得更好的演示):
df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
stringsAsFactors = FALSE)
df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
stringsAsFactors = FALSE)
In python/pandas, you can do: 在python / pandas中,您可以执行以下操作:
from pandas import DataFrame, Series
df1 = DataFrame(dict(
code1 = ('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = ('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom')))
df2 = DataFrame(dict(
code2 = ('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = ('blue', 'red', 'green', 'purple', 'orange', 'brown')))
matches = [df1[df1['code1'].str.contains(x)].index[0] for x in df2['code2']]
print(
df1.assign(subcode=Series(data=df2['code2'], index=matches))
.merge(df2, left_on='subcode', right_on='code2')
.drop('subcode', axis='columns')
)
And that dumps: 然后转储:
code1 name code2 color
0 123456789012 bob 123456789012 blue
1 210987654321 joe 2109876543 red
2 567890543211 sally 7890543211 green
3 987656789001 john 98765678900 purple
4 123456654321 lucy 12345665432 orange
5 678905432156 alan 678905432156 brown
Note: I hate using loops with dataframes, but this, uh, works, I guess. 注意:我讨厌将循环与数据帧一起使用,但是,我猜这是可行的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.