I have three dataframes as below.
df_prod
Year ID Name brand Point
2020 20903 Ken KKK 2000
2019 12890 Matt MMM 209
2017 346780 Nene NNN 2000
2020 346780 Nene NNN 6000
df_miss
Name brand point
Holy HHH 345
Joshi JJJ 900
df_sale
ID Name Year brand
126789 Holy 2010
346780 Nene 2017 NNN
346780 Nene 2020 NNN
I need to join df_sale depending upon the condition as below. If "brand" is NOTNULL, then I need to join df_sale with df_prod on INNER join ON Year and Name. If "brand" is NULL, then I need to join df_sale with df_miss based on Name.
Is it possible to have when condition during joins in pyspark? I could see some examples on scala but I am looking for pyspark implementation.
Pseudo code logic
if brand != null
df_sale.join(df_prod, on=['Year', 'ID'], how='inner') and df_sale['Name'] = df_prod['Name'] & df_sale['point'] = df_prod['point']
elif brand == null
df_sale.join(df_miss, on=['Name'], how='nner') and
df_sale['point'] = df_prod['point']
Expected output:
ID Name Year brand point
126789 Holy 2010 345
346780 Nene 2017 NNN 2000
346780 Nene 2020 NNN 2000
Is it possible to do in pyspark or SQL. Please give some pointers. Thanks.
When you think about IF ... ELSE ...
conditions in DataFrames (or for that matter, SQL tables), please be aware that these need to apply to the table as if you were traversing it row-by-row.
This leaves you with two options (please note that f
denotes pyspark.sql.functions
):
df_sale
table in two - df_sale_brand_null
and df_sale_brand
based on the f.col("brand").isNull()
condition using something like [input_df.filter(~fail_test), input_df.filter(fail_test)]
. Then you join with the relevant tables ( df_sales_brand_null
with df_miss
) on desired columns, you handle misaligned columns and finally you unionByName
the two joined tables.union
the dataframes df_miss
with df_prod
, handling for the missing columns in df_miss
. Then you join df_sale
with the unioned table (aliased a
and b
respectively) on a conditional statement, such as f.when(f.col("brand").isNotNull(), (f.col("a.Year") == f.col("b.Year")) & (f.col("a.ID") == f.col("b.ID")).otherwise(f.col("a.Name") == f.col("b.Name"))
. The output of f.when(...).otherwise(...)
is a column and therefore your join
statement will recognise it as a valid input to the on=
argument.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.