Pandas：如何查看 dataframe 中兩個列表之間的重疊？

Question

我有一個 dataframe，其中有兩列，每列都包含列表。 我想確定兩列列表之間的重疊。

例如：

df = pd.DataFrame({'one':[['a', 'b', 'c'], ['d', 'e', 'f'], ['h', 'i', 'j']], 
                   'two':[['b', 'c', 'd'], ['f', 'g', 'h',], ['l', 'm', 'n']]})

        one         two
    0   [a, b, c]   [b, c, d]
    1   [d, e, f]   [f, g, h]
    2   [h, i, j]   [l, m, n]

最終，我希望它看起來像：

        one         two             overlap
    0   [a, b, c]   [b, c, d]       [b, c]
    1   [d, e, f]   [f, g, h]       [f]
    2   [h, i, j]   [l, m, n]       []

Answer 1

沒有有效的矢量方法來執行此操作，最快的方法是使用set交集的列表理解：

df['overlap'] = [list(set(a)&set(b)) for a,b in zip(df['one'], df['two'])]

Output：

         one        two overlap
0  [a, b, c]  [b, c, d]  [b, c]
1  [d, e, f]  [f, g, h]     [f]
2  [h, i, j]  [l, m, n]      []

Answer 2

使用`pandas`

Pandas這樣做的方式可能是這樣的 -

f = lambda row: list(set(row['one']).intersection(row['two']))
df['overlap'] = df.apply(f,1)
print(df)

         one        two overlap
0  [a, b, c]  [b, c, d]  [b, c]
1  [d, e, f]  [f, g, h]     [f]
2  [h, i, j]  [l, m, n]      []

apply function 逐行 (axis=1) 並在one列和two列的列表之間找到set.intersection() 。 然后它將結果作為列表返回。

Apply方法不是最快的，但在我看來非常易讀。 但是由於您的問題沒有提到速度作為標准，所以這不會成為問題。

此外，您可以將這兩個表達式中的任何一個用作 lambda function，因為它們都執行相同的任務 -

#Option 1:
f = lambda x: list(set(x['one']) & set(x['two']))

#Option 2:
f = lambda x: list(set(x['one']).intersection(x['two']))

使用`Numpy`

您可以在 2 系列上使用 numpy 方法np.intersect1d以及 map。

import numpy as np
import pandas as pd

df['overlap'] = pd.Series(map(np.intersect1d, df['one'], df['two']))
print(df)

         one        two overlap
0  [a, b, c]  [b, c, d]  [b, c]
1  [d, e, f]  [f, g, h]     [f]
2  [h, i, j]  [l, m, n]      []

基准

添加一些基准以供參考 -

%timeit [list(set(a)&set(b)) for a,b in zip(df['one'], df['two'])]        #list comprehension
%timeit df.apply(lambda x: list(set(x['one']).intersection(x['two'])),1)  #apply 1
%timeit df.apply(lambda x: list(set(x['one']) & set(x['two'])),1)         #apply 2
%timeit pd.Series(map(np.intersect1d, df['one'], df['two']))              #numpy intersect1d

6.99 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
167 µs ± 830 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
166 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
84.1 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Answer 3

這是一種使用applymap將列表轉換為集合並使用set.intersection查找重疊的方法：

df.join(df.applymap(set).apply(lambda x: set.intersection(*x),axis=1).map(list).rename('overlap'))

Pandas：如何查看 dataframe 中兩個列表之間的重疊？

問題描述

3 個解決方案

解決方案1
1 2023-01-10 19:16:23

解決方案2
1 2023-01-10 19:27:15

使用`pandas`

使用`Numpy`

基准

解決方案3
1 2023-01-10 19:47:08

Pandas：如何查看 dataframe 中兩個列表之間的重疊？

問題描述

3 個解決方案

解決方案1 1 2023-01-10 19:16:23

解決方案2 1 2023-01-10 19:27:15

使用pandas

使用Numpy

基准

解決方案3 1 2023-01-10 19:47:08

解決方案1
1 2023-01-10 19:16:23

解決方案2
1 2023-01-10 19:27:15

使用`pandas`

使用`Numpy`

解決方案3
1 2023-01-10 19:47:08