简体   繁体   中英

Pyspark join with mixed conditions

I have two dataframes: left_df and right_df that have common columns to join on: ['col_1, 'col_2'] , and I want to join onto another condition: right_df.col_3.between(left_df.col_4, left_df.col_5)]

Code:

from pyspark.sql import functions as F

join_condition = ['col_1', 
                  'col_2', 
                  right_df.col_3.between(left_df.col_4, left_df.col_5)]
df = left_df.join(right_df, on=join_condition, how='left')

df.write.parquet('/tmp/my_df')

But I got the error below:

TypeError: Column is not iterable

Why I can't add those 3 conditions together?

You cannot mix strings with Columns. The expressions must be a list of strings or a list of Columns, not a mixture of both. You can convert the first two items to a column expression instead, eg

from pyspark.sql import functions as F

join_condition = [left_df.col_1 == right_df.col_1, 
                  left_df.col_2 == right_df.col_2, 
                  right_df.col_3.between(left_df.col_4, left_df.col_5)]

df = left_df.join(right_df, on=join_condition, how='left')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM