根據列表列加入兩個 pandas 數據幀

Question

我有2 個包含列表列的數據框。
我想根據列表中的2+ 個共享值加入他們。 例子：

ColumnA ColumnB        | ColumnA ColumnB        
id1     ['a','b','c']  | id3     ['a','b','c','x','y', 'z']
id2     ['a','d,'e']   |

在這種情況下，我們可以看到id1 與 id3 匹配，因為列表中有 2+ 個共享值。 所以 output 將是（列名並不重要，僅作為示例）：

    ColumnA1 ColumnB1     ColumnA2   ColumnB2        
    id1      ['a','b','c']  id3     ['a','b','c','x','y', 'z']

我怎樣才能達到這個結果？ 我試圖迭代 dataframe #1 中的每一行，但這似乎不是一個好主意。
謝謝！

Answer 1

使用行的笛卡爾積並檢查每一行

代碼在線記錄

df1 = pd.DataFrame(
    {
        'ColumnA': ['id1', 'id2'],
        'ColumnB': [['a','b','c'], ['a','d','e']],
    }
)

df2 = pd.DataFrame(
    {
        'ColumnA': ['id3'],
        'ColumnB': [['a','b','c','x','y', 'z']],
    }
)

# Take cartesian product of both dataframes
df1['k'] = 0
df2['k'] = 0
df = pd.merge(df1, df2, on='k').drop('k',1)
# Check the overlap of the lists and find the overlap length
df['overlap'] = df.apply(lambda x: len(set(x['ColumnB_x']).intersection(
                                   set(x['ColumnB_y']))), axis=1)
# Select whoes overlap length > 2
df = df[df['overlap'] > 2]
print (df)

Output：

  ColumnA_x  ColumnB_x ColumnA_y           ColumnB_y  overlap
0       id1  [a, b, c]       id3  [a, b, c, x, y, z]        3

Answer 2

如果您使用的是pandas 1.2.0 或更新版本（2020 年 12 月 26 日發布），笛卡爾積（交叉關節）可以簡化如下：

    df = df1.merge(df2, how='cross')         # simplified cross joint for pandas >= 1.2.0

此外，如果您關心系統性能（執行時間） ，建議使用list(map...而不是較慢的apply(... axis=1)

使用apply(... axis=1) ：

%%timeit
df['overlap'] = df.apply(lambda x: 
                         len(set(x['ColumnB1']).intersection(
                             set(x['ColumnB2']))), axis=1)


800 µs ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

在使用list(map(... :

%%timeit
df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))

217 µs ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

請注意，使用list(map...快 3 倍！

整套代碼供您參考：

    data = {'ColumnA1': ['id1', 'id2'], 'ColumnB1': [['a', 'b', 'c'], ['a', 'd', 'e']]}
    df1 = pd.DataFrame(data)

    data = {'ColumnA2': ['id3', 'id4'], 'ColumnB2': [['a','b','c','x','y', 'z'], ['d','e','f','p','q', 'r']]}
    df2 = pd.DataFrame(data)

    df = df1.merge(df2, how='cross')             # for pandas version >= 1.2.0

    df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))

    df = df[df['overlap'] >= 2]
    print (df)

根據列表列加入兩個 pandas 數據幀

問題描述

2 個解決方案

解決方案1
2 2021-02-05 09:54:16

解決方案2
0 已采納 2021-02-05 11:10:58

根據列表列加入兩個 pandas 數據幀

問題描述

2 個解決方案

解決方案1 2 2021-02-05 09:54:16

解決方案2 0 已采納 2021-02-05 11:10:58

解決方案1
2 2021-02-05 09:54:16

解決方案2
0 已采納 2021-02-05 11:10:58