How to join two pyspark dataframe based on the third one

Question

I want to join two pyspark dataframes using a third one. The third has information about from which of the first two DFs, data should be taken.

Dataframes ( first , second , control ):

ID1	ID2	ID3	Data1	Data2
2034	12	444	100	200
2034	12	233	1633	2546400
3211	11	311	456	544
3113	13	441	333	645

ID1	ID2	ID3	Data1	Data2
2034	12	444	133	444
2034	12	233	333	34211
3211	11	311	7685	867443
3113	13	441	6544	63457

ID2	SOURCE
11	first
12	second
13	first

After the join, data would look like this:

ID1	ID2	ID3	Data1	Data2
2034	12	444	133	444
2034	12	233	333	34211
3211	11	311	456	544
3113	13	441	333	645

How can I make it? I have such a scheme, but don't know how to apply the third dataframe control.

cols_list = [
             # cols with aliases to choose
            ]
            
first = first.alias("a").join(
    second.alias("b"), ((first['ID1'] == second['ID1']) & first['ID2'] == second['ID2']) & first['ID3'] == second['ID3'])), 'left'
).select(cols_list)

Answer 1

If you are comfortable with pure Spark SQL instead of the PySpark Dataframe API, then here is one solution.

First create data frames (optional since you already have data)

from pyspark.sql.types import StructType,StructField, IntegerType, StringType

core_schema = StructType([
  StructField("ID1",IntegerType()),
  StructField("ID2",IntegerType()),
  StructField("ID3",IntegerType()),
  StructField("Data1",IntegerType()),
  StructField("Data2",IntegerType()),
])

first_data = [
  (2034,    12,     444,    100,    200),
  (2034,    12,     233,    1633,   2546400),
  (3211,    11,     311,    456,    544),
  (3113,    13,     441,    333,    645),
]
second_data = [
  (2034,    12,     444,    133,    444),
  (2034,    12,     233,    333,    34211),
  (3211,    11,     311,    7685,   867443),
  (3113,    13,     441,    6544,   63457),
]

first_df = spark.createDataFrame(first_data,core_schema)
second_df = spark.createDataFrame(second_data,core_schema)

control_schema = StructType([
  StructField("ID2",IntegerType()),
  StructField("SOURCE",StringType())
])
control_data = [
  (11,  "first"),
  (12,  "second"),
  (13,  "first"),
]
control_df = spark.createDataFrame(control_data, control_schema)
display(control_df)

Next create temporary views for each dataframe. We need views in order to execute SQL queries. The below code creates a view called "first_tbl" from the "first_df" dataframe, etc

# create views
first_df.createOrReplaceTempView("first_tbl")
second_df.createOrReplaceTempView("second_tbl")
control_df.createOrReplaceTempView("control_tbl")

Next we write our SQL query. The query will:

do an inner join between first_tbl and second_tbl by all the ID columns
do an inner join with control_tbl
do a switch case statement to pick up data from correct table. For example if control_tbl.SOURCE is "first" then it picks up data from first_tbl


select 
  f.ID1, 
  f.ID2, 
  f.ID3,
  c.SOURCE,
  -- switch case statement to select data based on source
  CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
  CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
  
from first_tbl f
inner join second_tbl s
  on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
  on f.ID2 = c.ID2

Finally we execute the query and create a new dataframe called "final_df"

query = """
select 
  f.ID1, 
  f.ID2,
  f.ID3,
  c.SOURCE,
  -- switch case statement to select data based on source
  CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
  CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
  
from first_tbl f
inner join second_tbl s
  on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
  on f.ID2 = c.ID2
"""

final_df = spark.sql(query)
final_df.show()

Output:

+----+---+---+------+-----+-----+
| ID1|ID2|ID3|SOURCE|Data1|Data2|
+----+---+---+------+-----+-----+
|3211| 11|311| first|  456|  544|
|2034| 12|444|second|  133|  444|
|2034| 12|233|second|  333|34211|
|3113| 13|441| first|  333|  645|
+----+---+---+------+-----+-----+

How to join two pyspark dataframe based on the third one

Question

1 answers

solution1
0 2022-12-30 16:49:03

How to join two pyspark dataframe based on the third one

Question

1 answers

solution1 0 2022-12-30 16:49:03

solution1
0 2022-12-30 16:49:03