简体   繁体   中英

How to modify a column for a join in spark dataframe when the join key are given as a list?

I have been trying to join two dataframes using the following list of join key passed as a list and I want to add the functionality to join on a subset of the keys if one of the key value is null

I have been trying to join two dataframes df_1 and df_2.

data1 = [[1,'2018-07-31',215,'a'],
        [2,'2018-07-30',None,'b'],
        [3,'2017-10-28',201,'c']
     ]
df_1 = sqlCtx.createDataFrame(data1, 
['application_number','application_dt','account_id','var1']) 

and

data2 = [[1,'2018-07-31',215,'aaa'],
        [2,'2018-07-30',None,'bbb'],
        [3,'2017-10-28',201,'ccc']
        ]
df_2 = sqlCtx.createDataFrame(data2, 
['application_number','application_dt','account_id','var2'])

The code I use to join is this:

key_a = ['application_number','application_dt','account_id']
new = df_1.join(df_2,key_a,'left')

The output for the same is:

+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
|                 1|    2018-07-31|       215|   a| aaa|
|                 3|    2017-10-28|       201|   c| ccc|
|                 2|    2018-07-30|      null|   b|null|
+------------------+--------------+----------+----+----+

My concern here is, in the case where account_id is null, the join should still work by comparing other 2 keys.

The required output should be like this:

+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
|                 1|    2018-07-31|       215|   a| aaa|
|                 3|    2017-10-28|       201|   c| ccc|
|                 2|    2018-07-30|      null|   b| bbb|
+------------------+--------------+----------+----+----+

I have found a similar approach to do so by using the statement:

  join_elem = "df_1.application_number == 
  df_2.application_number|df_1.application_dt == 
  df_2.application_dt|F.coalesce(df_1.account_id,F.lit(0)) ==  
  F.coalesce(df_2.account_id,F.lit(0))".split("|")
  join_elem_column = [eval(x) for x in join_elem]

But the design consideration do not allow me to use a full join expression and i am stuck with using the list of column names as join-key.

I have been trying to find a way to accommodate this coalesce thing into this list itself but have not found any success so far.

I would call this solution a workaround.

The issue here is that we have Null value for one of the keys in the DataFrame and the OP wants that rest of the key columns to be used instead. Why not assign an arbitrary value to this Null and then apply the join. Effectively this would be same thing like making a join on the remaining two keys.

# Let's replace Null with an arbitrary value, which has
# little chance of occurring in the Dataset. For eg; -100000
df1 = df1.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))    
df2 = df2.withColumn('account_id', when(col('account_id').isNull(),-100000).otherwise(col('account_id')))

# Do a FULL Join
df = df1.join(df2,['application_number','application_dt','account_id'],'full')

# Replace the arbitrary value back with Null.    
df = df.withColumn('account_id', when(col('account_id')== -100000, None).otherwise(col('account_id')))
df.show()
+------------------+--------------+----------+----+----+
|application_number|application_dt|account_id|var1|var2|
+------------------+--------------+----------+----+----+
|                 1|    2018-07-31|       215|   a| aaa|
|                 2|    2018-07-30|      null|   b| bbb|
|                 3|    2017-10-28|       201|   c| ccc|
+------------------+--------------+----------+----+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM