Pyspark join with mixed conditions

Question

I have two dataframes: left_df and right_df that have common columns to join on: ['col_1, 'col_2'] , and I want to join onto another condition: right_df.col_3.between(left_df.col_4, left_df.col_5)]

Code:

from pyspark.sql import functions as F

join_condition = ['col_1', 
                  'col_2', 
                  right_df.col_3.between(left_df.col_4, left_df.col_5)]
df = left_df.join(right_df, on=join_condition, how='left')

df.write.parquet('/tmp/my_df')

But I got the error below:

TypeError: Column is not iterable

Why I can't add those 3 conditions together?

Answer 1

You cannot mix strings with Columns. The expressions must be a list of strings or a list of Columns, not a mixture of both. You can convert the first two items to a column expression instead, eg

from pyspark.sql import functions as F

join_condition = [left_df.col_1 == right_df.col_1, 
                  left_df.col_2 == right_df.col_2, 
                  right_df.col_3.between(left_df.col_4, left_df.col_5)]

df = left_df.join(right_df, on=join_condition, how='left')

Pyspark join with mixed conditions

Question

1 answers

solution1
2 ACCPTED 2021-04-23 11:53:31

Pyspark join with mixed conditions

Question

1 answers

solution1 2 ACCPTED 2021-04-23 11:53:31

solution1
2 ACCPTED 2021-04-23 11:53:31