在每行中找到第一個交叉點 pandas dataframe

Question

我有一個 dataframe：

import pandas as pd
data =[[28, ['first'], 'apple edible', 23, 'apple is an edible fruit'],
 [28, ['first'], 'apple edible', 34, 'fruit produced by an apple tree'],
 [28, ['first'], 'apple edible', 39, 'the apple is a pome edible fruit'],
 [21, ['second'], 'green plants', 11, 'plants are green'],
 [21, ['second'], 'green plants', 7, 'plant these perennial green flowers']]
df = pd.DataFrame(data, columns=['day', 'group',  'bigram', 'count', 'sentence'])
+---+--------+------------+-----+-----------------------------------+
|day|group   |bigram      |count|sentence                           |
+---+--------+------------+-----+-----------------------------------+
|28 |[first] |apple edible|23   |apple is an edible fruit           |
|28 |[first] |apple edible|34   |fruit produced by an apple tree    |
|28 |[first] |apple edible|39   |the apple is a pome edible fruit   |
|21 |[second]|green plants|11   |plants are green                   |
|21 |[second]|green plants|7    |plant these perennial green flowers|
+---+--------+------------+-----+-----------------------------------+

我需要找到二元組與句子的交集。 此外，找到第一個交叉點並將其標記為 True。 也就是說，在第一個交點之后，剩余的交點將被標記為 False。 詞序並不重要。

所以我想要這個結果：

+---+--------+------------+-----+--------------------------------+--------+
|day|group   |bigram      |count|sentence                        |        |
+---+--------+------------+-----+--------------------------------+--------+
|28 |[first] |apple edible|23   |apple is an edible fruit        |True    |
|28 |[first] |apple edible|34   |fruit produced by an apple tree |False   |
|28 |[first] |apple edible|39   |the apple is a pome edible fruit|False   |
|21 |[second]|green plants|11   |plant these perennial flowers   |False   |
|21 |[second]|green plants|7    |plants are green                |True    |
+---+--------+------------+-----+--------------------------------+--------+

Answer 1

首先通過將拆分值轉換為帶有issubset的集合來測試所有交集，然后bigram只有第一個True每個二元組：

df['new'] = [set(b.split()).issubset(a.split()) for a,b in zip(df['sentence'],df['bigram'])]
df['new'] = ~df.duplicated(['bigram','new']) & df['new']
print (df)
   day     group        bigram  count                             sentence  \
0   28   [first]  apple edible     23             apple is an edible fruit   
1   28   [first]  apple edible     34      fruit produced by an apple tree   
2   28   [first]  apple edible     39     the apple is a pome edible fruit   
3   21  [second]  green plants     11                     plants are green   
4   21  [second]  green plants      7  plant these perennial green flowers   

     new  
0   True  
1  False  
2  False  
3   True  
4  False

如果 bigram 中的順序應該交換並且需要第一個交集使用：

df['new'] = ~df.assign(bigram=df['bigram'].apply(lambda x: frozenset(x.split()))).duplicated(['bigram','new']) & df['new']

Answer 2

您可以使用兩個步驟，一個是識別 bigram 是句子子集的行（使用issubset ），然后僅保留第一個 True：

# use python sets to identify the matching bigrams
df['intersection'] = [set(a.split()).issubset(b.split())
                      for a,b in zip(df['bigram'], df['sentence'])]

# select the non-first matches and replace with False
df.loc[~df.index.isin(df.groupby(df['group'].str[0])['intersection'].idxmax()),
       'intersection'] = False

output：

   day     group        bigram  count                             sentence  intersection
0   28   [first]  apple edible     23             apple is an edible fruit          True
1   28   [first]  apple edible     34      fruit produced by an apple tree         False
2   28   [first]  apple edible     39     the apple is a pome edible fruit         False
3   21  [second]  green plants     11  plant these perennial green flowers         False
4   21  [second]  green plants      7                     plants are green          True

在每行中找到第一個交叉點 pandas dataframe

問題描述

2 個解決方案

解決方案1
2 已采納 2022-06-20 09:08:15

解決方案2
1 2022-06-20 09:08:32

在每行中找到第一個交叉點 pandas dataframe

問題描述

2 個解決方案

解決方案1 2 已采納 2022-06-20 09:08:15

解決方案2 1 2022-06-20 09:08:32

解決方案1
2 已采納 2022-06-20 09:08:15

解決方案2
1 2022-06-20 09:08:32