如何将pyspark中的两个数据框与结构或数组中的不同列合并？

Question

Lets say, there are two data-frames.可以说，有两个数据框。 Reference dataframe and Target dataframe .参考数据框和目标数据框。

Reference DF is a reference schema.参考 DF 是参考模式。

Schema for reference DF (r_df)参考 DF (r_df) 的架构

r_df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- notificationsSend: struct (nullable = true)
 |    |-- mail: boolean (nullable = true)
 |    |-- sms: boolean (nullable = true)
 |-- recordingDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- channelName: string (nullable = true)
 |    |    |-- fileLink: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- recorderId: string (nullable = true)
 |    |    |-- resourceId: string (nullable = true)

However, target data-frame schema is dynamic in nature.但是，目标数据框模式本质上是动态的。

Schema for target DF (t_df)目标 DF (t_df) 的架构

t_df.printSchema() t_df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- notificationsSend: struct (nullable = true)
 |    |-- sms: string (nullable = true)
 |-- recordingDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- channelName: string (nullable = true)
 |    |    |-- fileLink: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- recorderId: string (nullable = true)
 |    |    |-- resourceId: string (nullable = true)
 |    |    |-- createdBy: string (nullable = true)

So we observe multiple changes in target's schema.所以我们观察到目标模式的多个变化。

Columns inside t_df struct or array can have more or less columns. t_df结构或数组中的列可以有更多或更少的列。
Datatype of columns can change too.列的数据类型也可以改变。 So type casting is required.所以需要类型转换。 (Ex. sms column is boolean in r_df but string in t_df ) （例如sms列在r_df中是布尔值，但在t_df中是字符串）

I was able to add/remove columns which are of non-struct datatype.我能够添加/删除非结构数据类型的列。 However, struct and arrays are real pain for me.然而，结构和数组对我来说真的很痛苦。 Since there are 50+ columns, I need an optimised solution which works for all.由于有 50 多列，我需要一个适用于所有人的优化解决方案。

Any solution/ opinion/ way around will be really helpful.任何解决方案/意见/方式都会非常有帮助。

Expected output I want to make my t_df 's schema exactly same as my r_df 's schema.预期的输出我想让我的t_df的架构与我的r_df的架构完全相同。

Answer 1

below code is un-tested but should prescribe how to do it.下面的代码未经测试，但应该规定如何去做。 (written from memory without testing.) There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas. （从记忆中编写，未经测试。）可能有一种方法可以从结构中获取字段，但我不知道如何所以我有兴趣听到其他人的想法。

Extract struct column names and types.提取结构列名称和类型。
Find columns that need to be dropped查找需要删除的列
Drop columns删除列
rebuild struts according to r_df.根据 r_df 重建 struts。

stucts_in_r_df = [  field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields

struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
 struct_columns.append(r_df\
  .select(
   "$structs.*"
  ).columns
 )

missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.

# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens  col("$struct.$field") to get the values out of the fields,
result = r_df.union(
 tdf\ 
  .select(*(
   [ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
   [ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
  )
 )

Here's a way once you have the union to pull back the struct:一旦你有联合来拉回结构，这是一种方法：

result = result\
 .select(
   col("_id"),
   struct( col("sms").alias("sms") ).alias("notificationsSend"),
   struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
    ).alias("recordingDetails") #reconstitue struct with 
  )

如何将pyspark中的两个数据框与结构或数组中的不同列合并？

问题描述

1 个解决方案

解决方案1
0 2022-06-24 14:23:10

如何将pyspark中的两个数据框与结构或数组中的不同列合并？

问题描述

1 个解决方案

解决方案1 0 2022-06-24 14:23:10

解决方案1
0 2022-06-24 14:23:10