I have a spark dataframe like below. If the value in col2 is found in other rows in col1, I want to get the values for col3 in a list in a new column. And I would rather not use self-join.
input:
col1 col2 col3
A B 1
B C 2
B A 3
output:
col1 col2 col3 col4
A B 1 [2,3]
B C 2 []
B A 3 [1]
You need to create a mapping using groupby
and then use merge
.
mapper = df.groupby('col1', as_index=False).agg({'col3': list}).rename(columns={'col3':'col4', 'col1': 'col2'})
df.merge(mapper, on='col2', how='left')
Output:
col1 col2 col3 col4
0 A B 1 [2, 3]
1 B C 2 NaN
2 B A 3 [1]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.