简体   繁体   English

Python:结合 str.contains 并在 Pandas 中合并

[英]Python: combine str.contains and merge in pandas

I have two dataframes that look somewhat like the following (the Content column in df1 actually being the full content of an article and not, as in my example, only one sentence):我有两个看起来有点像下面的数据框( df1Content列实际上是一篇文章的完整内容,而不是在我的示例中,只有一个句子):

    PDF     Content
1   1234    This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2   1111    Johannes writes about apples and oranges and that's great.
3   8000    Content that cannot be matched to the anything in df1.    
4   3993    There is an interesting piece on bananas plus kiwis as well.
    ...

(Total: 5709 entries) (总计:5709 个条目)

    Author        Title
1   Johannes      Apples and oranges
2   Peter         Bananas and pears and grapes
3   Hannah        Bananas plus kiwis
4   Helena        Mangos and peaches
    ...

(Total: 10228 entries) (总计:10228 个条目)

I would like to merge both dataframes by searching for the Title from df2 in the Content of df1 .我想通过在df1Content中搜索来自df2Title来合并两个数据帧。 If the title appears somewhere in the first 2500 characters of the content, it is a match.如果标题出现在内容的前 2500 个字符的某处,则它是匹配的。 Note: it is important that all entries from df1 are preserved.注意:保留df1中的所有条目很重要。 In contrast, I only want to keep the entries from df2 that are matched (ie a left join).相比之下,我只想保留df2中匹配的条目(即左连接)。 Note: all Titles are unique values.注意:所有Titles都是唯一值。

Desired output (column sequence doesn't matter):所需的输出(列顺序无关紧要):

    Author     Title                        PDF     Content
1   Peter      Bananas and pears and grapes 1234    This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2   Johannes   Apples and oranges           1111    Johannes writes about apples and oranges and that's great.
3   NaN        NaN                          8000    Content that cannot be matched to the anything in df2.    
4   Hannah     Bananas plus kiwis           3993    There is an interesting piece on bananas plus kiwis as well.
    ...

I think I need a combination between pd.merge and str.contains , but I can't figure out how!我想我需要pd.mergestr.contains之间的组合,但我不知道如何!

Warning: the solution could be slow :).警告:解决方案可能很慢:)。
1. get list for title 1.获取标题列表
2. create index for df1 based on title list order 2.根据标题列表顺序为df1创建索引
3. concat df1 and df2 on idx 3. 在 idx 上连接 df1 和 df2

  lst = [item.lower() for item in df2.Title.tolist()]
  end = len(lst)
  def func(row):
    content = row[:2500].lower()
    for i, item in enumerate(lst):
      if item in content:
        return i
    end += 1
    return end
  df1 = df1.assign(idx=df1.Content.apply(func))

  res = pd.concat([df1.set_index('idx'), df2], axis=1)

output输出

      PDF                                            Content    Author  \
0  1111.0  Johannes writes about apples and oranges and t...  Johannes
1  1234.0  This article is about bananas and pears and gr...     Peter
2  3993.0  There is an interesting piece on bananas plus ...    Hannah
3     NaN                                                NaN    Helena
4  8000.0  Content that cannot be matched to the anything...       NaN

                          Title
0            Apples and oranges
1  Bananas and pears and grapes
2            Bananas plus kiwis
3            Mangos and peaches
4                           NaN

You could do a full cartesian join / cross product, then filter.你可以做一个完整的笛卡尔连接/交叉产品,然后过滤。 Since you couldn't do a hash lookup, it shouldn't be any slower than the equivalent "Join" statement:由于您无法进行哈希查找,因此它不应比等效的“Join”语句慢:

df1['key'] = 1
df2['key'] = 2
df3 = pd.merge(df1, df2, on='key')
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1)
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]

Which produces the table:产生表:

       PDF    Author                         Title  \
0   1234.0  Johannes            Apples and oranges
1   1234.0     Peter  Bananas and pears and grapes
4   1111.0  Johannes            Apples and oranges
14  3993.0    Hannah            Bananas plus kiwis

                                              Content
0   This article is about bananas and pears and gr...
1   This article is about bananas and pears and gr...
4   Johannes writes about apples and oranges and t...
14  There is an interesting piece on bananas plus ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM