简体   繁体   English

加入 Pandas DataFrames 匹配字符串和 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ

[英]Join Pandas DataFrames matching by string and substring

i want to merge two dataframes by partial string match.我想通过部分字符串匹配合并两个数据帧。 I have two data frames to combine.我有两个要组合的数据框。 First df1 consists of 130.000 rows like this:第一个 df1 由 130.000 行组成,如下所示:

id    text                        xc1       xc2
1     adidas men shoes            52465     220
2     vakko men suits             49220     224
3     burberry men shirt          78248     289
4     prada women shoes           45780     789
5     lcwaikiki men sunglasses    34788     745

and second df2 consists of 8000 rows like this:第二个 df2 由 8000 行组成,如下所示:

id    keyword               abc1     abc2
1     men shoes             1000     11
2     men suits             2000     12
3     men shirt             3000     13
4     women socks           4000     14
5     men sunglasses        5000     15

After matching between keyword and text , outputshould look like this:关键字文本匹配后,输出应该是这样的:

id    text                        xc1       xc2   keyword         abc1  abc2
1     adidas men shoes            52465     220   men shoes       1000  11
2     vakko men suits             49220     224   men suits       2000  12
3     burberry men shirt          78248     289   men shirt       3000  13
4     lcwaikiki men sunglasses    34788     745   men sunglasses  5000  15

Let's start by ordering the keywords longest-first, so that "women suits" matches "before "men suits"让我们首先对关键字进行最长的排序,以便“women suits”匹配“在“men suits”之前

lkeys = df2.keyword.reindex(df2.keyword.str.len().sort_values(ascending=False).index)

Now define a matching function;现在定义一个匹配的 function; each text value from df1 will be passed as s to find a matching keyword: df1中的每个text值都将作为s传递以查找匹配的关键字:

def is_match(arr, s):
    for a in arr:
        if a in s:
            return a
    return None

Now we can extract the keyword from each text in df1, and add it to a new column:现在我们可以从 df1 中的每个text中提取关键字,并将其添加到新列中:

df1['keyword'] = df1['text'].apply(lambda x: is_match(lkeys, x))

We now have everything we need for a standard merge:我们现在拥有标准合并所需的一切:

pd.merge(df1, df2, on='keyword')

Let's approach by cross join the 2 dataframes and then filter by matching string with substring, as follows:让我们通过交叉连接 2 个数据帧,然后通过匹配字符串与 substring 进行过滤,如下所示:

df3 = df1.merge(df2, how='cross')    # for Pandas version >= 1.2.0 (released in Dec 2020)

import re
mask = df3.apply(lambda x: (re.search(rf"\b{x['keyword']}\b", str(x['text']))) != None, axis=1)
df_out = df3.loc[mask]

If your Pandas version is older than 1.2.0 (released in Dec 2020) and does not support merge with how='cross' , you can replace the merge statement with:如果您的 Pandas 版本早于 1.2.0(2020 年 12 月发布)并且不支持与how='cross'合并,您可以将合并语句替换为:

# For Pandas version < 1.2.0
df3 = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)   

After the cross join, we created a boolean mask to filter for the cases that keyword is found within text by using re.search within .apply() .在交叉连接之后,我们创建了一个 boolean 掩码,通过在.apply()中使用re.search来过滤在text中找到keyword的情况。

We have to use re.search instead of simple Python substring test like stringA in stringB found in most of the similar answers in StackOverflow.我们必须使用re.search而不是简单的 Python substring 测试,就像在 StackOverflow 中的大多数类似答案中找到的stringA in stringB一样。 Such kind of test will fail with false match of 'men suits' in keyword with 'women suits' in text since it returns True for test of 'men suits' in 'women suits' .这种测试将失败, keyword中的'men suits'text'women suits'的错误匹配,因为它返回True以测试'men suits' in 'women suits'

We use regex with a pair of word boundary \b meta-characters around the keyword (regex pattern: rf"\b{x['keyword']}\b" ) to ensure matching only for whole word match for text in df1 , ie men suits in df2 would not match with women suits in df1 since the word women does not have a word boundary between the letters wo and men .我们使用正则表达式和keyword周围的一对单词边界\b元字符(正则表达式模式: rf"\b{x['keyword']}\b" )以确保仅匹配text df1整个单词匹配,即df2中的men suitsdf1中的women suits不匹配,因为单词women在字母women之间没有单词边界。

Result:结果:

print(df_out)


    id_x                      text    xc1  xc2  id_y         keyword  abc1  abc2
0      1          adidas men shoes  52465  220     1       men shoes  1000    11
6      2           vakko men suits  49220  224     2       men suits  2000    12
12     3        burberry men shirt  78248  289     3       men shirt  3000    13
24     5  lcwaikiki men sunglasses  34788  745     5  men sunglasses  5000    15

Here, columns id_x and id_y are the original id column in df1 and df2 respectively.这里,列id_xid_y分别是df1df2中的原始id列。 As seen from the comment, these are just row numbers of the dataframes that you may not care about.从评论中可以看出,这些只是您可能不关心的数据帧的行号。 We can then remove these 2 columns and reset index to clean up the layout:然后我们可以删除这 2 列并重置索引以清理布局:

df_out = df_out.drop(['id_x', 'id_y'], axis=1).reset_index(drop=True)

Final outcome最终结果

print(df_out)


                       text    xc1  xc2         keyword  abc1  abc2
0          adidas men shoes  52465  220       men shoes  1000    11
1           vakko men suits  49220  224       men suits  2000    12
2        burberry men shirt  78248  289       men shirt  3000    13
3  lcwaikiki men sunglasses  34788  745  men sunglasses  5000    15

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM