[英]How to merge two dataframes in pyspark with different columns inside struct or array?
可以說,有兩個數據框。 參考數據框和目標數據框。
參考 DF 是參考模式。
參考 DF (r_df) 的架構
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
但是,目標數據框模式本質上是動態的。
目標 DF (t_df) 的架構
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
所以我們觀察到目標模式的多個變化。
我能夠添加/刪除非結構數據類型的列。 然而,結構和數組對我來說真的很痛苦。 由於有 50 多列,我需要一個適用於所有人的優化解決方案。
任何解決方案/意見/方式都會非常有幫助。
預期的輸出我想讓我的t_df的架構與我的r_df的架構完全相同。
下面的代碼未經測試,但應該規定如何去做。 (從記憶中編寫,未經測試。)可能有一種方法可以從結構中獲取字段,但我不知道如何所以我有興趣聽到其他人的想法。
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
一旦你有聯合來拉回結構,這是一種方法:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.