[英]Pandas: How do I see the overlap between two lists in a dataframe?
我有一個 dataframe,其中有兩列,每列都包含列表。 我想確定兩列列表之間的重疊。
例如:
df = pd.DataFrame({'one':[['a', 'b', 'c'], ['d', 'e', 'f'], ['h', 'i', 'j']],
'two':[['b', 'c', 'd'], ['f', 'g', 'h',], ['l', 'm', 'n']]})
one two
0 [a, b, c] [b, c, d]
1 [d, e, f] [f, g, h]
2 [h, i, j] [l, m, n]
最終,我希望它看起來像:
one two overlap
0 [a, b, c] [b, c, d] [b, c]
1 [d, e, f] [f, g, h] [f]
2 [h, i, j] [l, m, n] []
沒有有效的矢量方法來執行此操作,最快的方法是使用set
交集的列表理解:
df['overlap'] = [list(set(a)&set(b)) for a,b in zip(df['one'], df['two'])]
Output:
one two overlap
0 [a, b, c] [b, c, d] [b, c]
1 [d, e, f] [f, g, h] [f]
2 [h, i, j] [l, m, n] []
pandas
Pandas
這樣做的方式可能是這樣的 -
f = lambda row: list(set(row['one']).intersection(row['two']))
df['overlap'] = df.apply(f,1)
print(df)
one two overlap
0 [a, b, c] [b, c, d] [b, c]
1 [d, e, f] [f, g, h] [f]
2 [h, i, j] [l, m, n] []
apply function 逐行 (axis=1) 並在one
列和two
列的列表之間找到set.intersection()
。 然后它將結果作為列表返回。
Apply
方法不是最快的,但在我看來非常易讀。 但是由於您的問題沒有提到速度作為標准,所以這不會成為問題。
此外,您可以將這兩個表達式中的任何一個用作 lambda function,因為它們都執行相同的任務 -
#Option 1:
f = lambda x: list(set(x['one']) & set(x['two']))
#Option 2:
f = lambda x: list(set(x['one']).intersection(x['two']))
Numpy
您可以在 2 系列上使用 numpy 方法np.intersect1d
以及 map。
import numpy as np
import pandas as pd
df['overlap'] = pd.Series(map(np.intersect1d, df['one'], df['two']))
print(df)
one two overlap
0 [a, b, c] [b, c, d] [b, c]
1 [d, e, f] [f, g, h] [f]
2 [h, i, j] [l, m, n] []
添加一些基准以供參考 -
%timeit [list(set(a)&set(b)) for a,b in zip(df['one'], df['two'])] #list comprehension
%timeit df.apply(lambda x: list(set(x['one']).intersection(x['two'])),1) #apply 1
%timeit df.apply(lambda x: list(set(x['one']) & set(x['two'])),1) #apply 2
%timeit pd.Series(map(np.intersect1d, df['one'], df['two'])) #numpy intersect1d
6.99 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
167 µs ± 830 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
166 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
84.1 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
這是一種使用applymap
將列表轉換為集合並使用set.intersection
查找重疊的方法:
df.join(df.applymap(set).apply(lambda x: set.intersection(*x),axis=1).map(list).rename('overlap'))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.