How can I match two rows in a pyspark dataframe when the value in a column in a row matches the value in another column in another row?

Question

I have a spark dataframe like below. If the value in col2 is found in other rows in col1, I want to get the values for col3 in a list in a new column. And I would rather not use self-join.

input:

col1    col2    col3  
A       B       1  
B       C       2
B       A       3

output:

col1    col2    col3    col4
A       B       1       [2,3]  
B       C       2       []
B       A       3       [1]

Answer 1

You need to create a mapping using groupby and then use merge .

mapper = df.groupby('col1', as_index=False).agg({'col3': list}).rename(columns={'col3':'col4', 'col1': 'col2'})
df.merge(mapper, on='col2', how='left')

Output:

  col1  col2    col3    col4
0   A   B       1      [2, 3]
1   B   C       2      NaN
2   B   A       3      [1]

How can I match two rows in a pyspark dataframe when the value in a column in a row matches the value in another column in another row?

Question

1 answers

solution1
1 ACCPTED 2019-08-13 19:55:51

How can I match two rows in a pyspark dataframe when the value in a column in a row matches the value in another column in another row?

Question

1 answers

solution1 1 ACCPTED 2019-08-13 19:55:51

solution1
1 ACCPTED 2019-08-13 19:55:51