简体   繁体   English

PySpark 动态连接 DataFrames 多列(“或”运算符)

[英]PySpark join DataFrames multiple columns dynamically ('or' operator)

I have a scenario where I need to dynamically join two DataFrames.我有一个场景,我需要动态加入两个 DataFrame。 I am creating a helper function and passing DataFrames as input parameters like this.我正在创建一个辅助函数并将数据帧作为输入参数传递,如下所示。

def joinDataFrame(first_df, second_df, first_cols, second_cols,join_type) -> DataFrame:
   return_df = first_df.join(second_df, (col(f) == col(s) for (f,s) in zip(first_cols, second_cols), join_type)
   return return_df

This works fine if I only have 'and' scenarios, but I have requirements to pass 'or' conditions as well.如果我只有“和”场景,这很好用,但我也有通过“或”条件的要求。

I did try to build a string containing the condition and then using expr() I can pass the join condition but I am getting 'ParseException' .我确实尝试构建一个包含条件的字符串,然后使用expr()我可以传递连接条件,但我得到了'ParseException'

I would prefer to build the 'join' condition and pass it as a parameter to this function.我更愿意构建“加入”条件并将其作为参数传递给此函数。

Reduce using |减少使用| on zipped equality conditions:在压缩平等条件下:

from functools import reduce

join_cond = reduce(
        lambda x, y: x | y,
        (first_df[f] == second_df[s] for (f,s) in zip(first_cols, second_cols))
    )

return_df = first_df.join(second_df, join_cond, join_type)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM