如何在第三個的基礎上加入兩個pyspark dataframe

Question

我想使用第三個加入兩個 pyspark 數據幀。 第三個包含有關應從前兩個 DF 中獲取數據的信息。

數據幀（ first ， second ， control ）：

ID1	ID2	ID3	數據1	數據2
2034	12	444	100	200
2034	12	233	1633	2546400
3211	11	311	456	544
3113	13	441	333	645

ID1	ID2	ID3	數據1	數據2
2034	12	444	133	444
2034	12	233	333	34211
3211	11	311	7685	867443
3113	13	441	6544	63457

ID2	資源
11	第一的
12	第二
13	第一的

加入后，數據將如下所示：

ID1	ID2	ID3	數據1	數據2
2034	12	444	133	444
2034	12	233	333	34211
3211	11	311	456	544
3113	13	441	333	645

我怎樣才能做到？ 我有這樣的方案，但不知道如何應用第三個dataframe控件。

cols_list = [
             # cols with aliases to choose
            ]
            
first = first.alias("a").join(
    second.alias("b"), ((first['ID1'] == second['ID1']) & first['ID2'] == second['ID2']) & first['ID3'] == second['ID3'])), 'left'
).select(cols_list)

Answer 1

如果您對純 Spark SQL 而不是 PySpark Dataframe API 感到滿意，那么這是一種解決方案。

首先創建數據框（可選，因為您已經有數據）

from pyspark.sql.types import StructType,StructField, IntegerType, StringType

core_schema = StructType([
  StructField("ID1",IntegerType()),
  StructField("ID2",IntegerType()),
  StructField("ID3",IntegerType()),
  StructField("Data1",IntegerType()),
  StructField("Data2",IntegerType()),
])

first_data = [
  (2034,    12,     444,    100,    200),
  (2034,    12,     233,    1633,   2546400),
  (3211,    11,     311,    456,    544),
  (3113,    13,     441,    333,    645),
]
second_data = [
  (2034,    12,     444,    133,    444),
  (2034,    12,     233,    333,    34211),
  (3211,    11,     311,    7685,   867443),
  (3113,    13,     441,    6544,   63457),
]

first_df = spark.createDataFrame(first_data,core_schema)
second_df = spark.createDataFrame(second_data,core_schema)

control_schema = StructType([
  StructField("ID2",IntegerType()),
  StructField("SOURCE",StringType())
])
control_data = [
  (11,  "first"),
  (12,  "second"),
  (13,  "first"),
]
control_df = spark.createDataFrame(control_data, control_schema)
display(control_df)

接下來為每個 dataframe 創建臨時視圖。我們需要視圖才能執行 SQL 查詢。 下面的代碼從“first_df”dataframe 等創建一個名為“first_tbl”的視圖

# create views
first_df.createOrReplaceTempView("first_tbl")
second_df.createOrReplaceTempView("second_tbl")
control_df.createOrReplaceTempView("control_tbl")

接下來我們編寫 SQL 查詢。 查詢將：

通過所有 ID 列在 first_tbl 和 second_tbl 之間進行內部連接
使用 control_tbl 進行內部連接
執行 switch case 語句以從正確的表中獲取數據。 例如，如果 control_tbl.SOURCE 是“first”，那么它會從 first_tbl 中獲取數據


select 
  f.ID1, 
  f.ID2, 
  f.ID3,
  c.SOURCE,
  -- switch case statement to select data based on source
  CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
  CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
  
from first_tbl f
inner join second_tbl s
  on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
  on f.ID2 = c.ID2

最后我們執行查詢並創建一個名為“final_df”的新 dataframe

query = """
select 
  f.ID1, 
  f.ID2,
  f.ID3,
  c.SOURCE,
  -- switch case statement to select data based on source
  CASE WHEN c.SOURCE == "first" then f.Data1 else s.Data1 END as Data1,
  CASE WHEN c.SOURCE == "first" then f.Data2 else s.Data2 END as Data2
  
from first_tbl f
inner join second_tbl s
  on f.ID1 = s.ID1 and f.ID2 = s.ID2 and f.ID3 = s.ID3
inner join control_tbl c
  on f.ID2 = c.ID2
"""

final_df = spark.sql(query)
final_df.show()

Output：

+----+---+---+------+-----+-----+
| ID1|ID2|ID3|SOURCE|Data1|Data2|
+----+---+---+------+-----+-----+
|3211| 11|311| first|  456|  544|
|2034| 12|444|second|  133|  444|
|2034| 12|233|second|  333|34211|
|3113| 13|441| first|  333|  645|
+----+---+---+------+-----+-----+

如何在第三個的基礎上加入兩個pyspark dataframe

問題描述

1 個解決方案

解決方案1
0 2022-12-30 16:49:03

如何在第三個的基礎上加入兩個pyspark dataframe

問題描述

1 個解決方案

解決方案1 0 2022-12-30 16:49:03

解決方案1
0 2022-12-30 16:49:03