简体   繁体   中英

How to write a universal function to join two PySpark dataframes?

How to write a universal function to join two PySpark dataframes?

I want to write a function that performs inner join on two dataframes and also eliminates the repeated common column after joining. As far as I'm aware there is no way to do that, as we always need to define common columns manually while joining. Or is there a way?

If you need to include all the common columns in the join condition then you can extract them into a list and pass to join() . After the join, just call drop on these same columns to eliminate them from the result.

common_cols = list(set(df.columns).intersection(set(df2.columns)))

df3 = df.join(df2, common_cols, how='inner').drop(*common_cols)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM