[英]Python: combine str.contains and merge in pandas
I have two dataframes that look somewhat like the following (the Content
column in df1
actually being the full content of an article and not, as in my example, only one sentence):我有两个看起来有点像下面的数据框(
df1
的Content
列实际上是一篇文章的完整内容,而不是在我的示例中,只有一个句子):
PDF Content
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 1111 Johannes writes about apples and oranges and that's great.
3 8000 Content that cannot be matched to the anything in df1.
4 3993 There is an interesting piece on bananas plus kiwis as well.
...
(Total: 5709 entries) (总计:5709 个条目)
Author Title
1 Johannes Apples and oranges
2 Peter Bananas and pears and grapes
3 Hannah Bananas plus kiwis
4 Helena Mangos and peaches
...
(Total: 10228 entries) (总计:10228 个条目)
I would like to merge both dataframes by searching for the Title
from df2
in the Content
of df1
.我想通过在
df1
的Content
中搜索来自df2
的Title
来合并两个数据帧。 If the title appears somewhere in the first 2500 characters of the content, it is a match.如果标题出现在内容的前 2500 个字符的某处,则它是匹配的。 Note: it is important that all entries from
df1
are preserved.注意:保留
df1
中的所有条目很重要。 In contrast, I only want to keep the entries from df2
that are matched (ie a left join).相比之下,我只想保留
df2
中匹配的条目(即左连接)。 Note: all Titles
are unique values.注意:所有
Titles
都是唯一值。
Desired output (column sequence doesn't matter):所需的输出(列顺序无关紧要):
Author Title PDF Content
1 Peter Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 Johannes Apples and oranges 1111 Johannes writes about apples and oranges and that's great.
3 NaN NaN 8000 Content that cannot be matched to the anything in df2.
4 Hannah Bananas plus kiwis 3993 There is an interesting piece on bananas plus kiwis as well.
...
I think I need a combination between pd.merge
and str.contains
, but I can't figure out how!我想我需要
pd.merge
和str.contains
之间的组合,但我不知道如何!
Warning: the solution could be slow :).警告:解决方案可能很慢:)。
1. get list for title 1.获取标题列表
2. create index for df1 based on title list order 2.根据标题列表顺序为df1创建索引
3. concat df1 and df2 on idx 3. 在 idx 上连接 df1 和 df2
lst = [item.lower() for item in df2.Title.tolist()]
end = len(lst)
def func(row):
content = row[:2500].lower()
for i, item in enumerate(lst):
if item in content:
return i
end += 1
return end
df1 = df1.assign(idx=df1.Content.apply(func))
res = pd.concat([df1.set_index('idx'), df2], axis=1)
output输出
PDF Content Author \
0 1111.0 Johannes writes about apples and oranges and t... Johannes
1 1234.0 This article is about bananas and pears and gr... Peter
2 3993.0 There is an interesting piece on bananas plus ... Hannah
3 NaN NaN Helena
4 8000.0 Content that cannot be matched to the anything... NaN
Title
0 Apples and oranges
1 Bananas and pears and grapes
2 Bananas plus kiwis
3 Mangos and peaches
4 NaN
You could do a full cartesian join / cross product, then filter.你可以做一个完整的笛卡尔连接/交叉产品,然后过滤。 Since you couldn't do a hash lookup, it shouldn't be any slower than the equivalent "Join" statement:
由于您无法进行哈希查找,因此它不应比等效的“Join”语句慢:
df1['key'] = 1
df2['key'] = 2
df3 = pd.merge(df1, df2, on='key')
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1)
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]
Which produces the table:产生表:
PDF Author Title \
0 1234.0 Johannes Apples and oranges
1 1234.0 Peter Bananas and pears and grapes
4 1111.0 Johannes Apples and oranges
14 3993.0 Hannah Bananas plus kiwis
Content
0 This article is about bananas and pears and gr...
1 This article is about bananas and pears and gr...
4 Johannes writes about apples and oranges and t...
14 There is an interesting piece on bananas plus ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.