简体   繁体   English

通过部分字符串匹配合并两个数据帧

[英]Merge two dataframes by partial string match

I am trying to merge two fairly large dataframes of different sizes based on partial string matches. 我试图基于部分字符串匹配来合并两个不同大小的相当大的数据帧。

df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code. df1 $ code包含所有12位数字代码,而df2 $ code包含10-12位数字的代码组合,其中一些较短的代码是与df1 $ code中12位数字代码匹配的子字符串。

Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1. 因此,我需要合并两个数据帧之间的所有12位数字匹配项,还要合并df2中具有10-11位数字代码的记录,这些记录是与df1的子字符串匹配。

Example dataframes: 示例数据框:

df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
              name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))

df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
              color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))

df3 (merged)

code_1         code_2         name  color
123456789012   123456789012   bob   blue
210987654321   2109876543     joe   red
567890543211   7890543211     sally green
987656789001   98765678900    john  purple
123456654321   12345665432    lucy  orange
678905432156   678905432156   alan  brown

Try this SQL join. 尝试此SQL连接。

library(sqldf)

sqldf("select a.code_1, b.code_2, a.name, b.color 
       from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")

giving: 赠送:

        code_1       code_2  name  color
1 123456789012 123456789012   bob   blue
2 210987654321   2109876543   joe    red
3 567890543211   7890543211 sally  green
4 987656789001  98765678900  john purple
5 123456654321  12345665432  lucy orange
6 678905432156 678905432156  alan  brown

Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1 and code_2 . 更新:更新了答案以反映所讨论的更改,以便(1)子字符串可以在目标字符串中的任何位置,并且(2)代码列的名称已更改为code_1code_2

Updated per new info. 根据新信息更新。 This should work: 这应该工作:

df2$New <- lapply(df2$code_2, grep, df1$code_1,value=T)

combined <- merge(df1,df2, by.x="code_1", by.y="New")

        code_1  name       code_2  color
1 123456654321  lucy  12345665432 orange
2 123456789012   bob 123456789012   blue
3 210987654321   joe   2109876543    red
4 567890543211 sally   7890543211  green
5 678905432156  alan 678905432156  brown
6 987656789001  john  98765678900 purple

We can use grep + sapply to extract indices of matches from df2$code for each df1$code and create a matchID out of it. 我们可以使用grep + sapplydf2$code为每个df1$code提取匹配索引,并在其中创建一个matchID Next, we merge on matchID to get desired output: 接下来,我们在matchIDmerge以获得所需的输出:

df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))

df_merge = merge(df1, df2, by = "matchID")[-1]

Note that if a df1$code does not match any df2$code , df2$matchID will be blank, and so would not merge with df1$matchID . 请注意,如果df1$code与任何df2$code不匹配,则df2$matchID将为空,因此不会与df1$matchID合并。

Results: 结果:

> df2
          code  color matchID
1 123456789012   blue       1
2   2109876543    red       2
3   7890543211  green       3
4  98765678900 purple       4
5  12345665432 orange       5
6 678905432156  brown       6
7  14124124124  black        

> df_merge
        code.x  name       code.y  color
1 123456789012   bob 123456789012   blue
2 210987654321   joe   2109876543    red
3 567890543211 sally   7890543211  green
4 987656789001  john  98765678900 purple
5 123456654321  lucy  12345665432 orange
6 678905432156  alan 678905432156  brown

Data (Added non-match for better demo): 数据(添加了不匹配项以获得更好的演示):

df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
                  name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
                  stringsAsFactors = FALSE)

df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
                  color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
                  stringsAsFactors = FALSE)

In python/pandas, you can do: 在python / pandas中,您可以执行以下操作:

from pandas import DataFrame, Series
df1 = DataFrame(dict(
        code1 = ('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
        name = ('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom')))

df2 = DataFrame(dict(
        code2 = ('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
        color = ('blue', 'red', 'green', 'purple', 'orange', 'brown')))

matches = [df1[df1['code1'].str.contains(x)].index[0] for x in df2['code2']]

print(
    df1.assign(subcode=Series(data=df2['code2'], index=matches))
       .merge(df2, left_on='subcode', right_on='code2')
       .drop('subcode', axis='columns')
)

And that dumps: 然后转储:

          code1   name         code2   color
0  123456789012    bob  123456789012    blue
1  210987654321    joe    2109876543     red
2  567890543211  sally    7890543211   green
3  987656789001   john   98765678900  purple
4  123456654321   lucy   12345665432  orange
5  678905432156   alan  678905432156   brown

Note: I hate using loops with dataframes, but this, uh, works, I guess. 注意:我讨厌将循环与数据帧一起使用,但是,我猜这是可行的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM