如何將pyspark中的兩個數據框與結構或數組中的不同列合並？

Question

可以說，有兩個數據框。 參考數據框和目標數據框。

參考 DF 是參考模式。

參考 DF (r_df) 的架構

r_df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- notificationsSend: struct (nullable = true)
 |    |-- mail: boolean (nullable = true)
 |    |-- sms: boolean (nullable = true)
 |-- recordingDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- channelName: string (nullable = true)
 |    |    |-- fileLink: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- recorderId: string (nullable = true)
 |    |    |-- resourceId: string (nullable = true)

但是，目標數據框模式本質上是動態的。

目標 DF (t_df) 的架構

t_df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- notificationsSend: struct (nullable = true)
 |    |-- sms: string (nullable = true)
 |-- recordingDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- channelName: string (nullable = true)
 |    |    |-- fileLink: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- recorderId: string (nullable = true)
 |    |    |-- resourceId: string (nullable = true)
 |    |    |-- createdBy: string (nullable = true)

所以我們觀察到目標模式的多個變化。

t_df結構或數組中的列可以有更多或更少的列。
列的數據類型也可以改變。 所以需要類型轉換。 （例如sms列在r_df中是布爾值，但在t_df中是字符串）

我能夠添加/刪除非結構數據類型的列。 然而，結構和數組對我來說真的很痛苦。 由於有 50 多列，我需要一個適用於所有人的優化解決方案。

任何解決方案/意見/方式都會非常有幫助。

預期的輸出我想讓我的t_df的架構與我的r_df的架構完全相同。

Answer 1

下面的代碼未經測試，但應該規定如何去做。 （從記憶中編寫，未經測試。）可能有一種方法可以從結構中獲取字段，但我不知道如何所以我有興趣聽到其他人的想法。

提取結構列名稱和類型。
查找需要刪除的列
刪除列
根據 r_df 重建 struts。

stucts_in_r_df = [  field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields

struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
 struct_columns.append(r_df\
  .select(
   "$structs.*"
  ).columns
 )

missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.

# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens  col("$struct.$field") to get the values out of the fields,
result = r_df.union(
 tdf\ 
  .select(*(
   [ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
   [ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
  )
 )

一旦你有聯合來拉回結構，這是一種方法：

result = result\
 .select(
   col("_id"),
   struct( col("sms").alias("sms") ).alias("notificationsSend"),
   struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
    ).alias("recordingDetails") #reconstitue struct with 
  )

如何將pyspark中的兩個數據框與結構或數組中的不同列合並？

問題描述

1 個解決方案

解決方案1
0 2022-06-24 14:23:10

如何將pyspark中的兩個數據框與結構或數組中的不同列合並？

問題描述

1 個解決方案

解決方案1 0 2022-06-24 14:23:10

解決方案1
0 2022-06-24 14:23:10