通过部分字符串匹配合并两个数据帧

Question

I am trying to merge two fairly large dataframes of different sizes based on partial string matches. 我试图基于部分字符串匹配来合并两个不同大小的相当大的数据帧。

df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code. df1 $ code包含所有12位数字代码，而df2 $ code包含10-12位数字的代码组合，其中一些较短的代码是与df1 $ code中12位数字代码匹配的子字符串。

Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1. 因此，我需要合并两个数据帧之间的所有12位数字匹配项，还要合并df2中具有10-11位数字代码的记录，这些记录是与df1的子字符串匹配。

Example dataframes: 示例数据框：

df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
              name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))

df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
              color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))

df3 (merged)

code_1         code_2         name  color
123456789012   123456789012   bob   blue
210987654321   2109876543     joe   red
567890543211   7890543211     sally green
987656789001   98765678900    john  purple
123456654321   12345665432    lucy  orange
678905432156   678905432156   alan  brown

Answer 1

Try this SQL join. 尝试此SQL连接。

library(sqldf)

sqldf("select a.code_1, b.code_2, a.name, b.color 
       from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")

giving: 赠送：

        code_1       code_2  name  color
1 123456789012 123456789012   bob   blue
2 210987654321   2109876543   joe    red
3 567890543211   7890543211 sally  green
4 987656789001  98765678900  john purple
5 123456654321  12345665432  lucy orange
6 678905432156 678905432156  alan  brown

Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1 and code_2 . 更新：更新了答案以反映所讨论的更改，以便（1）子字符串可以在目标字符串中的任何位置，并且（2）代码列的名称已更改为code_1和code_2 。

Answer 2

Updated per new info. 根据新信息更新。 This should work: 这应该工作：

df2$New <- lapply(df2$code_2, grep, df1$code_1,value=T)

combined <- merge(df1,df2, by.x="code_1", by.y="New")

        code_1  name       code_2  color
1 123456654321  lucy  12345665432 orange
2 123456789012   bob 123456789012   blue
3 210987654321   joe   2109876543    red
4 567890543211 sally   7890543211  green
5 678905432156  alan 678905432156  brown
6 987656789001  john  98765678900 purple

Answer 3

We can use grep + sapply to extract indices of matches from df2$code for each df1$code and create a matchID out of it. 我们可以使用grep + sapply从df2$code为每个df1$code提取匹配索引，并在其中创建一个matchID 。 Next, we merge on matchID to get desired output: 接下来，我们在matchID上merge以获得所需的输出：

df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))

df_merge = merge(df1, df2, by = "matchID")[-1]

Note that if a df1$code does not match any df2$code , df2$matchID will be blank, and so would not merge with df1$matchID . 请注意，如果df1$code与任何df2$code不匹配，则df2$matchID将为空，因此不会与df1$matchID合并。

Results: 结果：

> df2
          code  color matchID
1 123456789012   blue       1
2   2109876543    red       2
3   7890543211  green       3
4  98765678900 purple       4
5  12345665432 orange       5
6 678905432156  brown       6
7  14124124124  black        

> df_merge
        code.x  name       code.y  color
1 123456789012   bob 123456789012   blue
2 210987654321   joe   2109876543    red
3 567890543211 sally   7890543211  green
4 987656789001  john  98765678900 purple
5 123456654321  lucy  12345665432 orange
6 678905432156  alan 678905432156  brown

Data (Added non-match for better demo): 数据（添加了不匹配项以获得更好的演示）：

df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
                  name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
                  stringsAsFactors = FALSE)

df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
                  color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
                  stringsAsFactors = FALSE)

Answer 4

In python/pandas, you can do: 在python / pandas中，您可以执行以下操作：

from pandas import DataFrame, Series
df1 = DataFrame(dict(
        code1 = ('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
        name = ('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom')))

df2 = DataFrame(dict(
        code2 = ('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
        color = ('blue', 'red', 'green', 'purple', 'orange', 'brown')))

matches = [df1[df1['code1'].str.contains(x)].index[0] for x in df2['code2']]

print(
    df1.assign(subcode=Series(data=df2['code2'], index=matches))
       .merge(df2, left_on='subcode', right_on='code2')
       .drop('subcode', axis='columns')
)

And that dumps: 然后转储：

          code1   name         code2   color
0  123456789012    bob  123456789012    blue
1  210987654321    joe    2109876543     red
2  567890543211  sally    7890543211   green
3  987656789001   john   98765678900  purple
4  123456654321   lucy   12345665432  orange
5  678905432156   alan  678905432156   brown

Note: I hate using loops with dataframes, but this, uh, works, I guess. 注意：我讨厌将循环与数据帧一起使用，但是，我猜这是可行的。

通过部分字符串匹配合并两个数据帧

问题描述

4 个解决方案

解决方案1
2 2017-09-22 16:25:38

解决方案2
1 2017-09-22 16:20:37

解决方案3
1 2017-09-22 16:41:36

解决方案4
0 2017-09-22 16:18:44

通过部分字符串匹配合并两个数据帧

问题描述

4 个解决方案

解决方案1 2 2017-09-22 16:25:38

解决方案2 1 2017-09-22 16:20:37

解决方案3 1 2017-09-22 16:41:36

解决方案4 0 2017-09-22 16:18:44

解决方案1
2 2017-09-22 16:25:38

解决方案2
1 2017-09-22 16:20:37

解决方案3
1 2017-09-22 16:41:36

解决方案4
0 2017-09-22 16:18:44