[英]Join Pandas DataFrames matching by string and substring
i want to merge two dataframes by partial string match.我想通过部分字符串匹配合并两个数据帧。 I have two data frames to combine.
我有两个要组合的数据框。 First df1 consists of 130.000 rows like this:
第一个 df1 由 130.000 行组成,如下所示:
id text xc1 xc2
1 adidas men shoes 52465 220
2 vakko men suits 49220 224
3 burberry men shirt 78248 289
4 prada women shoes 45780 789
5 lcwaikiki men sunglasses 34788 745
and second df2 consists of 8000 rows like this:第二个 df2 由 8000 行组成,如下所示:
id keyword abc1 abc2
1 men shoes 1000 11
2 men suits 2000 12
3 men shirt 3000 13
4 women socks 4000 14
5 men sunglasses 5000 15
After matching between keyword and text , outputshould look like this:在关键字和文本匹配后,输出应该是这样的:
id text xc1 xc2 keyword abc1 abc2
1 adidas men shoes 52465 220 men shoes 1000 11
2 vakko men suits 49220 224 men suits 2000 12
3 burberry men shirt 78248 289 men shirt 3000 13
4 lcwaikiki men sunglasses 34788 745 men sunglasses 5000 15
Let's start by ordering the keywords longest-first, so that "women suits" matches "before "men suits"让我们首先对关键字进行最长的排序,以便“women suits”匹配“在“men suits”之前
lkeys = df2.keyword.reindex(df2.keyword.str.len().sort_values(ascending=False).index)
Now define a matching function;现在定义一个匹配的 function; each
text
value from df1
will be passed as s
to find a matching keyword: df1
中的每个text
值都将作为s
传递以查找匹配的关键字:
def is_match(arr, s):
for a in arr:
if a in s:
return a
return None
Now we can extract the keyword from each text
in df1, and add it to a new column:现在我们可以从 df1 中的每个
text
中提取关键字,并将其添加到新列中:
df1['keyword'] = df1['text'].apply(lambda x: is_match(lkeys, x))
We now have everything we need for a standard merge:我们现在拥有标准合并所需的一切:
pd.merge(df1, df2, on='keyword')
Let's approach by cross join the 2 dataframes and then filter by matching string with substring, as follows:让我们通过交叉连接 2 个数据帧,然后通过匹配字符串与 substring 进行过滤,如下所示:
df3 = df1.merge(df2, how='cross') # for Pandas version >= 1.2.0 (released in Dec 2020)
import re
mask = df3.apply(lambda x: (re.search(rf"\b{x['keyword']}\b", str(x['text']))) != None, axis=1)
df_out = df3.loc[mask]
If your Pandas version is older than 1.2.0 (released in Dec 2020) and does not support merge with how='cross'
, you can replace the merge statement with:如果您的 Pandas 版本早于 1.2.0(2020 年 12 月发布)并且不支持与
how='cross'
合并,您可以将合并语句替换为:
# For Pandas version < 1.2.0
df3 = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
After the cross join, we created a boolean mask to filter for the cases that keyword
is found within text
by using re.search
within .apply()
.在交叉连接之后,我们创建了一个 boolean 掩码,通过在
.apply()
中使用re.search
来过滤在text
中找到keyword
的情况。
We have to use re.search
instead of simple Python substring test like stringA in stringB
found in most of the similar answers in StackOverflow.我们必须使用
re.search
而不是简单的 Python substring 测试,就像在 StackOverflow 中的大多数类似答案中找到的stringA in stringB
一样。 Such kind of test will fail with false match of 'men suits'
in keyword
with 'women suits'
in text
since it returns True
for test of 'men suits' in 'women suits'
.这种测试将失败,
keyword
中的'men suits'
与text
中'women suits'
的错误匹配,因为它返回True
以测试'men suits' in 'women suits'
。
We use regex with a pair of word boundary \b
meta-characters around the keyword
(regex pattern: rf"\b{x['keyword']}\b"
) to ensure matching only for whole word match for text
in df1
, ie men suits
in df2
would not match with women suits
in df1
since the word women
does not have a word boundary between the letters wo
and men
.我们使用正则表达式和
keyword
周围的一对单词边界\b
元字符(正则表达式模式: rf"\b{x['keyword']}\b"
)以确保仅匹配text
df1
整个单词匹配,即df2
中的men suits
与df1
中的women suits
不匹配,因为单词women
在字母wo
和men
之间没有单词边界。
Result:结果:
print(df_out)
id_x text xc1 xc2 id_y keyword abc1 abc2
0 1 adidas men shoes 52465 220 1 men shoes 1000 11
6 2 vakko men suits 49220 224 2 men suits 2000 12
12 3 burberry men shirt 78248 289 3 men shirt 3000 13
24 5 lcwaikiki men sunglasses 34788 745 5 men sunglasses 5000 15
Here, columns id_x
and id_y
are the original id
column in df1
and df2
respectively.这里,列
id_x
和id_y
分别是df1
和df2
中的原始id
列。 As seen from the comment, these are just row numbers of the dataframes that you may not care about.从评论中可以看出,这些只是您可能不关心的数据帧的行号。 We can then remove these 2 columns and reset index to clean up the layout:
然后我们可以删除这 2 列并重置索引以清理布局:
df_out = df_out.drop(['id_x', 'id_y'], axis=1).reset_index(drop=True)
Final outcome最终结果
print(df_out)
text xc1 xc2 keyword abc1 abc2
0 adidas men shoes 52465 220 men shoes 1000 11
1 vakko men suits 49220 224 men suits 2000 12
2 burberry men shirt 78248 289 men shirt 3000 13
3 lcwaikiki men sunglasses 34788 745 men sunglasses 5000 15
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.