[英]How to merge two dataframes in pyspark with different columns inside struct or array?
Lets say, there are two data-frames.可以说,有两个数据框。 Reference dataframe and Target dataframe .
参考数据框和目标数据框。
Reference DF is a reference schema.参考 DF 是参考模式。
Schema for reference DF (r_df)参考 DF (r_df) 的架构
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.但是,目标数据框模式本质上是动态的。
Schema for target DF (t_df)目标 DF (t_df) 的架构
t_df.printSchema()
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.所以我们观察到目标模式的多个变化。
I was able to add/remove columns which are of non-struct datatype.我能够添加/删除非结构数据类型的列。 However, struct and arrays are real pain for me.
然而,结构和数组对我来说真的很痛苦。 Since there are 50+ columns, I need an optimised solution which works for all.
由于有 50 多列,我需要一个适用于所有人的优化解决方案。
Any solution/ opinion/ way around will be really helpful.任何解决方案/意见/方式都会非常有帮助。
Expected output I want to make my t_df 's schema exactly same as my r_df 's schema.预期的输出我想让我的t_df的架构与我的r_df的架构完全相同。
below code is un-tested but should prescribe how to do it.下面的代码未经测试,但应该规定如何去做。 (written from memory without testing.) There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
(从记忆中编写,未经测试。)可能有一种方法可以从结构中获取字段,但我不知道如何所以我有兴趣听到其他人的想法。
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:一旦你有联合来拉回结构,这是一种方法:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.