简体   繁体   中英

How to join two pyspark dataframe based on the third one

I want to join two pyspark dataframes using a third one. The third has information about from which of the first two DFs, data should be taken.

Dataframes ( first , second , control ):

ID1 ID2 ID3 Data1 Data2
2034 12 444 100 200
2034 12 233 1633 2546400
3211 11 311 456 544
3113 13 441 333 645
ID1 ID2 ID3 Data1 Data2
2034 12 444 133 444
2034 12 233 333 34211
3211 11 311 7685 867443
3113 13 441 6544 63457
ID2 SOURCE
11 first
12 second
13 first

After the join, data would look like this:

ID1 ID2 ID3 Data1 Data2
2034 12 444 133 444
2034 12 233 333 34211
3211 11 311 456 544
3113 13 441 333 645

How can I make it? I have such a scheme, but don't know how to apply the third dataframe control.

cols_list = [
             # cols with aliases to choose
            ]
            
first = first.alias("a").join(
    second.alias("b"), ((first['ID1'] == second['ID1']) & first['ID2'] == second['ID2']) & first['ID3'] == second['ID3'])), 'left'
).select(cols_list)

If you are comfortable with pure Spark SQL instead of the PySpark Dataframe API, then here is one solution.

First create data frames (optional since you already have data)

from pyspark.sql.types import StructType,StructField, IntegerType, StringType

core_schema = StructType([
  StructField("ID1",IntegerType()),
  StructField("ID2",IntegerType()),
  StructField("ID3",IntegerType()),
  StructField("Data1",IntegerType()),
  StructField("Data2",IntegerType()),
])

first_data = [
  (2034,    12,     444,    100,    200),
  (2034,    12,     233,    1633,   2546400),
  (3211,    11,     311,    456,    544),
  (3113,    13,     441,    333,    645),
]
second_data = [
  (2034,    12,     444,    133,    444),
  (2034,    12,     233,    333,    34211),
  (3211,    11,     311,    7685,   867443),
  (3113,    13,     441,    6544,   63457),
]

first_df = spark.createDataFrame(first_data,core_schema)
second_df = spark.createDataFrame(second_data,core_schema)

control_schema = StructType([
  StructField("ID2",IntegerType()),
  StructField("SOURCE",StringType())
])
control_data = [
  (11,  "first"),
  (12,  "second"),
  (13,  "first"),
]
control_df = spark.createDataFrame(control_data, control_schema)
display(control_df)

Next create temporary views for each dataframe. We need views in order to execute SQL queries. The below code creates a view called "first_tbl" from the "first_df" dataframe, etc

# create views
first_df.createOrReplaceTempView("first_tbl")
second_df.createOrReplaceTempView("second_tbl")
control_df.createOrReplaceTempView("control_tbl")

Next we write our SQL query. The query will:

  • do an inner join between first_tbl and second_tbl by all the ID columns
  • do an inner join with control_tbl
  • do a switch case statement to pick up data from correct table. For example if control_tbl.SOURCE is "first" then it picks up data from first_tbl

select 
  f.ID1, 
  f.ID2, 
  f.ID3,
  c.SOURCE,
  -- switch case statement to select data based on source
  CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
  CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
  
from first_tbl f
inner join second_tbl s
  on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
  on f.ID2 = c.ID2

Finally we execute the query and create a new dataframe called "final_df"

query = """
select 
  f.ID1, 
  f.ID2,
  f.ID3,
  c.SOURCE,
  -- switch case statement to select data based on source
  CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
  CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
  
from first_tbl f
inner join second_tbl s
  on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
  on f.ID2 = c.ID2
"""

final_df = spark.sql(query)
final_df.show()

Output:

+----+---+---+------+-----+-----+
| ID1|ID2|ID3|SOURCE|Data1|Data2|
+----+---+---+------+-----+-----+
|3211| 11|311| first|  456|  544|
|2034| 12|444|second|  133|  444|
|2034| 12|233|second|  333|34211|
|3113| 13|441| first|  333|  645|
+----+---+---+------+-----+-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM