[英]How to join two pyspark dataframe based on the third one
我想使用第三個加入兩個 pyspark 數據幀。 第三個包含有關應從前兩個 DF 中獲取數據的信息。
數據幀( first
, second
, control
):
ID1 | ID2 | ID3 | 數據1 | 數據2 |
---|---|---|---|---|
2034 | 12 | 444 | 100 | 200 |
2034 | 12 | 233 | 1633 | 2546400 |
3211 | 11 | 311 | 456 | 544 |
3113 | 13 | 441 | 333 | 645 |
ID1 | ID2 | ID3 | 數據1 | 數據2 |
---|---|---|---|---|
2034 | 12 | 444 | 133 | 444 |
2034 | 12 | 233 | 333 | 34211 |
3211 | 11 | 311 | 7685 | 867443 |
3113 | 13 | 441 | 6544 | 63457 |
ID2 | 資源 |
---|---|
11 | 第一的 |
12 | 第二 |
13 | 第一的 |
加入后,數據將如下所示:
ID1 | ID2 | ID3 | 數據1 | 數據2 |
---|---|---|---|---|
2034 | 12 | 444 | 133 | 444 |
2034 | 12 | 233 | 333 | 34211 |
3211 | 11 | 311 | 456 | 544 |
3113 | 13 | 441 | 333 | 645 |
我怎樣才能做到? 我有這樣的方案,但不知道如何應用第三個dataframe控件。
cols_list = [
# cols with aliases to choose
]
first = first.alias("a").join(
second.alias("b"), ((first['ID1'] == second['ID1']) & first['ID2'] == second['ID2']) & first['ID3'] == second['ID3'])), 'left'
).select(cols_list)
如果您對純 Spark SQL 而不是 PySpark Dataframe API 感到滿意,那么這是一種解決方案。
首先創建數據框(可選,因為您已經有數據)
from pyspark.sql.types import StructType,StructField, IntegerType, StringType
core_schema = StructType([
StructField("ID1",IntegerType()),
StructField("ID2",IntegerType()),
StructField("ID3",IntegerType()),
StructField("Data1",IntegerType()),
StructField("Data2",IntegerType()),
])
first_data = [
(2034, 12, 444, 100, 200),
(2034, 12, 233, 1633, 2546400),
(3211, 11, 311, 456, 544),
(3113, 13, 441, 333, 645),
]
second_data = [
(2034, 12, 444, 133, 444),
(2034, 12, 233, 333, 34211),
(3211, 11, 311, 7685, 867443),
(3113, 13, 441, 6544, 63457),
]
first_df = spark.createDataFrame(first_data,core_schema)
second_df = spark.createDataFrame(second_data,core_schema)
control_schema = StructType([
StructField("ID2",IntegerType()),
StructField("SOURCE",StringType())
])
control_data = [
(11, "first"),
(12, "second"),
(13, "first"),
]
control_df = spark.createDataFrame(control_data, control_schema)
display(control_df)
接下來為每個 dataframe 創建臨時視圖。我們需要視圖才能執行 SQL 查詢。 下面的代碼從“first_df”dataframe 等創建一個名為“first_tbl”的視圖
# create views
first_df.createOrReplaceTempView("first_tbl")
second_df.createOrReplaceTempView("second_tbl")
control_df.createOrReplaceTempView("control_tbl")
接下來我們編寫 SQL 查詢。 查詢將:
select
f.ID1,
f.ID2,
f.ID3,
c.SOURCE,
-- switch case statement to select data based on source
CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
from first_tbl f
inner join second_tbl s
on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
on f.ID2 = c.ID2
最后我們執行查詢並創建一個名為“final_df”的新 dataframe
query = """
select
f.ID1,
f.ID2,
f.ID3,
c.SOURCE,
-- switch case statement to select data based on source
CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
from first_tbl f
inner join second_tbl s
on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
on f.ID2 = c.ID2
"""
final_df = spark.sql(query)
final_df.show()
Output:
+----+---+---+------+-----+-----+
| ID1|ID2|ID3|SOURCE|Data1|Data2|
+----+---+---+------+-----+-----+
|3211| 11|311| first| 456| 544|
|2034| 12|444|second| 133| 444|
|2034| 12|233|second| 333|34211|
|3113| 13|441| first| 333| 645|
+----+---+---+------+-----+-----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.