I want to join two pyspark dataframes using a third one. The third has information about from which of the first two DFs, data should be taken.
Dataframes ( first
, second
, control
):
ID1 | ID2 | ID3 | Data1 | Data2 |
---|---|---|---|---|
2034 | 12 | 444 | 100 | 200 |
2034 | 12 | 233 | 1633 | 2546400 |
3211 | 11 | 311 | 456 | 544 |
3113 | 13 | 441 | 333 | 645 |
ID1 | ID2 | ID3 | Data1 | Data2 |
---|---|---|---|---|
2034 | 12 | 444 | 133 | 444 |
2034 | 12 | 233 | 333 | 34211 |
3211 | 11 | 311 | 7685 | 867443 |
3113 | 13 | 441 | 6544 | 63457 |
ID2 | SOURCE |
---|---|
11 | first |
12 | second |
13 | first |
After the join, data would look like this:
ID1 | ID2 | ID3 | Data1 | Data2 |
---|---|---|---|---|
2034 | 12 | 444 | 133 | 444 |
2034 | 12 | 233 | 333 | 34211 |
3211 | 11 | 311 | 456 | 544 |
3113 | 13 | 441 | 333 | 645 |
How can I make it? I have such a scheme, but don't know how to apply the third dataframe control.
cols_list = [
# cols with aliases to choose
]
first = first.alias("a").join(
second.alias("b"), ((first['ID1'] == second['ID1']) & first['ID2'] == second['ID2']) & first['ID3'] == second['ID3'])), 'left'
).select(cols_list)
If you are comfortable with pure Spark SQL instead of the PySpark Dataframe API, then here is one solution.
First create data frames (optional since you already have data)
from pyspark.sql.types import StructType,StructField, IntegerType, StringType
core_schema = StructType([
StructField("ID1",IntegerType()),
StructField("ID2",IntegerType()),
StructField("ID3",IntegerType()),
StructField("Data1",IntegerType()),
StructField("Data2",IntegerType()),
])
first_data = [
(2034, 12, 444, 100, 200),
(2034, 12, 233, 1633, 2546400),
(3211, 11, 311, 456, 544),
(3113, 13, 441, 333, 645),
]
second_data = [
(2034, 12, 444, 133, 444),
(2034, 12, 233, 333, 34211),
(3211, 11, 311, 7685, 867443),
(3113, 13, 441, 6544, 63457),
]
first_df = spark.createDataFrame(first_data,core_schema)
second_df = spark.createDataFrame(second_data,core_schema)
control_schema = StructType([
StructField("ID2",IntegerType()),
StructField("SOURCE",StringType())
])
control_data = [
(11, "first"),
(12, "second"),
(13, "first"),
]
control_df = spark.createDataFrame(control_data, control_schema)
display(control_df)
Next create temporary views for each dataframe. We need views in order to execute SQL queries. The below code creates a view called "first_tbl" from the "first_df" dataframe, etc
# create views
first_df.createOrReplaceTempView("first_tbl")
second_df.createOrReplaceTempView("second_tbl")
control_df.createOrReplaceTempView("control_tbl")
Next we write our SQL query. The query will:
select
f.ID1,
f.ID2,
f.ID3,
c.SOURCE,
-- switch case statement to select data based on source
CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
from first_tbl f
inner join second_tbl s
on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
on f.ID2 = c.ID2
Finally we execute the query and create a new dataframe called "final_df"
query = """
select
f.ID1,
f.ID2,
f.ID3,
c.SOURCE,
-- switch case statement to select data based on source
CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
from first_tbl f
inner join second_tbl s
on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
on f.ID2 = c.ID2
"""
final_df = spark.sql(query)
final_df.show()
Output:
+----+---+---+------+-----+-----+
| ID1|ID2|ID3|SOURCE|Data1|Data2|
+----+---+---+------+-----+-----+
|3211| 11|311| first| 456| 544|
|2034| 12|444|second| 133| 444|
|2034| 12|233|second| 333|34211|
|3113| 13|441| first| 333| 645|
+----+---+---+------+-----+-----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.