How to write a universal function to join two PySpark dataframes?

Question

How to write a universal function to join two PySpark dataframes?

I want to write a function that performs inner join on two dataframes and also eliminates the repeated common column after joining. As far as I'm aware there is no way to do that, as we always need to define common columns manually while joining. Or is there a way?

Answer 1

If you need to include all the common columns in the join condition then you can extract them into a list and pass to join() . After the join, just call drop on these same columns to eliminate them from the result.

common_cols = list(set(df.columns).intersection(set(df2.columns)))

df3 = df.join(df2, common_cols, how='inner').drop(*common_cols)

How to write a universal function to join two PySpark dataframes?

Question

1 answers

solution1
1 ACCPTED 2022-07-30 09:27:34

How to write a universal function to join two PySpark dataframes?

Question

1 answers

solution1 1 ACCPTED 2022-07-30 09:27:34

solution1
1 ACCPTED 2022-07-30 09:27:34