PySpark Dataframes: Full Outer Join with a condition

Question

I have the following 2 dataframes-

dataframe_a
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|     wanadoo.fr|
|        samantha|     randomn.fr|
|             bob|    eidsiva.net|
|           dylan|    vodafone.it|
+----------------+---------------+

dataframe_b
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|  oldwebsite.fr|
|        samantha|     randomn.fr|
|           dylan|      oldweb.it|
|            ryan|      chicks.it|
+----------------+---------------+

I want to do a full outer join but retain the value from the domain column of dataframe_a in cases where I get 2 different domains for a single user_id . So, my desired dataframe would look like-

desired_df
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|     wanadoo.fr|
|        samantha|     randomn.fr|
|             bob|    eidsiva.net|
|           dylan|    vodafone.it|
|            ryan|      chicks.it|
+----------------+---------------+

I think I can do something like-

desired_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer").drop(dataframe_b.domain)

But I'm worried if this will give me ryan in my desired dataframe or not. Is this the right way?

Answer 1

You will want to use 'coalesce'. In your current solution, ryan will be in the resulting dataframe, but with a null value for the remaining dataframe_a.domain column.

joined_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer")

+----------------+---------------+---------------+
|         user_id|         domain|         domain|
+----------------+---------------+---------------+
|            josh|     wanadoo.fr|  oldwebsite.fr|
|        samantha|     randomn.fr|     randomn.fr|
|             bob|    eidsiva.net|               |
|           dylan|    vodafone.it|      oldweb.it|
|            ryan|               |      chicks.it|
+----------------+---------------+---------------+

'coalesce' allows you to specific a preference order, but skips null values.

import pyspark.sql.functions as F
joined_df = joined_df.withColumn(
  "preferred_domain",
  F.coalesce(dataframe_a.domain, dataframe_b.domain)
)
joined_df = joined_df.drop(dataframe_a.domain).drop(dataframe_b.domain)

Giving

+----------------+----------------+
|         user_id|preferred_domain|
+----------------+----------------+
|            josh|      wanadoo.fr|
|        samantha|      randomn.fr|
|             bob|     eidsiva.net|
|           dylan|     vodafone.it|
|            ryan|       chicks.it|
+----------------+----------------+

Answer 2

No, doing a full_outer join will leave have the desired dataframe with the domain name corresponding to ryan as null value.No type of join operation on the above given dataframes will give you the desired output.

PySpark Dataframes: Full Outer Join with a condition

Question

2 answers

solution1
3 ACCPTED 2019-11-21 07:01:12

solution2
0 2019-11-21 06:59:22

PySpark Dataframes: Full Outer Join with a condition

Question

2 answers

solution1 3 ACCPTED 2019-11-21 07:01:12

solution2 0 2019-11-21 06:59:22

solution1
3 ACCPTED 2019-11-21 07:01:12

solution2
0 2019-11-21 06:59:22