[英]Join two pandas dataframes based on lists columns
我有2 個包含列表列的數據框。
我想根據列表中的2+ 個共享值加入他們。 例子:
ColumnA ColumnB | ColumnA ColumnB
id1 ['a','b','c'] | id3 ['a','b','c','x','y', 'z']
id2 ['a','d,'e'] |
在這種情況下,我們可以看到id1 與 id3 匹配,因為列表中有 2+ 個共享值。 所以 output 將是(列名並不重要,僅作為示例):
ColumnA1 ColumnB1 ColumnA2 ColumnB2
id1 ['a','b','c'] id3 ['a','b','c','x','y', 'z']
我怎樣才能達到這個結果? 我試圖迭代 dataframe #1 中的每一行,但這似乎不是一個好主意。
謝謝!
使用行的笛卡爾積並檢查每一行
代碼在線記錄
df1 = pd.DataFrame(
{
'ColumnA': ['id1', 'id2'],
'ColumnB': [['a','b','c'], ['a','d','e']],
}
)
df2 = pd.DataFrame(
{
'ColumnA': ['id3'],
'ColumnB': [['a','b','c','x','y', 'z']],
}
)
# Take cartesian product of both dataframes
df1['k'] = 0
df2['k'] = 0
df = pd.merge(df1, df2, on='k').drop('k',1)
# Check the overlap of the lists and find the overlap length
df['overlap'] = df.apply(lambda x: len(set(x['ColumnB_x']).intersection(
set(x['ColumnB_y']))), axis=1)
# Select whoes overlap length > 2
df = df[df['overlap'] > 2]
print (df)
Output:
ColumnA_x ColumnB_x ColumnA_y ColumnB_y overlap
0 id1 [a, b, c] id3 [a, b, c, x, y, z] 3
如果您使用的是pandas 1.2.0 或更新版本(2020 年 12 月 26 日發布),笛卡爾積(交叉關節)可以簡化如下:
df = df1.merge(df2, how='cross') # simplified cross joint for pandas >= 1.2.0
此外,如果您關心系統性能(執行時間) ,建議使用list(map...
而不是較慢的apply(... axis=1)
使用apply(... axis=1)
:
%%timeit
df['overlap'] = df.apply(lambda x:
len(set(x['ColumnB1']).intersection(
set(x['ColumnB2']))), axis=1)
800 µs ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
在使用list(map(...
:
%%timeit
df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))
217 µs ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
請注意,使用list(map...
快 3 倍!
整套代碼供您參考:
data = {'ColumnA1': ['id1', 'id2'], 'ColumnB1': [['a', 'b', 'c'], ['a', 'd', 'e']]}
df1 = pd.DataFrame(data)
data = {'ColumnA2': ['id3', 'id4'], 'ColumnB2': [['a','b','c','x','y', 'z'], ['d','e','f','p','q', 'r']]}
df2 = pd.DataFrame(data)
df = df1.merge(df2, how='cross') # for pandas version >= 1.2.0
df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))
df = df[df['overlap'] >= 2]
print (df)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.