简体   繁体   中英

PySpark join DataFrames multiple columns dynamically ('or' operator)

I have a scenario where I need to dynamically join two DataFrames. I am creating a helper function and passing DataFrames as input parameters like this.

def joinDataFrame(first_df, second_df, first_cols, second_cols,join_type) -> DataFrame:
   return_df = first_df.join(second_df, (col(f) == col(s) for (f,s) in zip(first_cols, second_cols), join_type)
   return return_df

This works fine if I only have 'and' scenarios, but I have requirements to pass 'or' conditions as well.

I did try to build a string containing the condition and then using expr() I can pass the join condition but I am getting 'ParseException' .

I would prefer to build the 'join' condition and pass it as a parameter to this function.

Reduce using | on zipped equality conditions:

from functools import reduce

join_cond = reduce(
        lambda x, y: x | y,
        (first_df[f] == second_df[s] for (f,s) in zip(first_cols, second_cols))
    )

return_df = first_df.join(second_df, join_cond, join_type)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM