简体   繁体   中英

pyspark join with conditions for empty string

I have three dataframes as below.

df_prod

Year  ID      Name   brand  Point 
2020  20903   Ken    KKK    2000
2019  12890   Matt   MMM    209
2017  346780  Nene   NNN    2000
2020  346780  Nene   NNN    6000

df_miss

Name    brand   point
Holy    HHH     345
Joshi   JJJ     900

df_sale

ID      Name  Year    brand   
126789  Holy  2010            
346780  Nene  2017    NNN     
346780  Nene  2020    NNN     

I need to join df_sale depending upon the condition as below. If "brand" is NOTNULL, then I need to join df_sale with df_prod on INNER join ON Year and Name. If "brand" is NULL, then I need to join df_sale with df_miss based on Name.

Is it possible to have when condition during joins in pyspark? I could see some examples on scala but I am looking for pyspark implementation.

Pseudo code logic

if brand != null
   df_sale.join(df_prod, on=['Year', 'ID'], how='inner') and df_sale['Name'] = df_prod['Name'] & df_sale['point'] = df_prod['point']
   
elif brand == null
   df_sale.join(df_miss, on=['Name'], how='nner') and
   df_sale['point'] = df_prod['point']

Expected output:

ID      Name  Year    brand   point
126789  Holy  2010            345
346780  Nene  2017    NNN     2000
346780  Nene  2020    NNN     2000

Is it possible to do in pyspark or SQL. Please give some pointers. Thanks.

When you think about IF ... ELSE ... conditions in DataFrames (or for that matter, SQL tables), please be aware that these need to apply to the table as if you were traversing it row-by-row.

This leaves you with two options (please note that f denotes pyspark.sql.functions ):

  1. You split your df_sale table in two - df_sale_brand_null and df_sale_brand based on the f.col("brand").isNull() condition using something like [input_df.filter(~fail_test), input_df.filter(fail_test)] . Then you join with the relevant tables ( df_sales_brand_null with df_miss ) on desired columns, you handle misaligned columns and finally you unionByName the two joined tables.
  2. You union the dataframes df_miss with df_prod , handling for the missing columns in df_miss . Then you join df_sale with the unioned table (aliased a and b respectively) on a conditional statement, such as f.when(f.col("brand").isNotNull(), (f.col("a.Year") == f.col("b.Year")) & (f.col("a.ID") == f.col("b.ID")).otherwise(f.col("a.Name") == f.col("b.Name")) . The output of f.when(...).otherwise(...) is a column and therefore your join statement will recognise it as a valid input to the on= argument.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM