简体   繁体   English

如何将pyspark中的两个数据框与结构或数组中的不同列合并?

[英]How to merge two dataframes in pyspark with different columns inside struct or array?

Lets say, there are two data-frames.可以说,有两个数据框。 Reference dataframe and Target dataframe .参考数据框目标数据框

Reference DF is a reference schema.参考 DF 是参考模式。

Schema for reference DF (r_df)参考 DF (r_df) 的架构

r_df.printSchema()
root
 |-- _id: string (nullable = true)
 |-- notificationsSend: struct (nullable = true)
 |    |-- mail: boolean (nullable = true)
 |    |-- sms: boolean (nullable = true)
 |-- recordingDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- channelName: string (nullable = true)
 |    |    |-- fileLink: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- recorderId: string (nullable = true)
 |    |    |-- resourceId: string (nullable = true)

However, target data-frame schema is dynamic in nature.但是,目标数据框模式本质上是动态的。

Schema for target DF (t_df)目标 DF (t_df) 的架构

t_df.printSchema() t_df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- notificationsSend: struct (nullable = true)
 |    |-- sms: string (nullable = true)
 |-- recordingDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- channelName: string (nullable = true)
 |    |    |-- fileLink: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- recorderId: string (nullable = true)
 |    |    |-- resourceId: string (nullable = true)
 |    |    |-- createdBy: string (nullable = true)

So we observe multiple changes in target's schema.所以我们观察到目标模式的多个变化。

  1. Columns inside t_df struct or array can have more or less columns. t_df结构或数组中的列可以有更多或更少的列。
  2. Datatype of columns can change too.列的数据类型也可以改变。 So type casting is required.所以需要类型转换。 (Ex. sms column is boolean in r_df but string in t_df ) (例如sms列在r_df中是布尔值,但在t_df中是字符串

I was able to add/remove columns which are of non-struct datatype.我能够添加/删除非结构数据类型的列。 However, struct and arrays are real pain for me.然而,结构和数组对我来说真的很痛苦。 Since there are 50+ columns, I need an optimised solution which works for all.由于有 50 多列,我需要一个适用于所有人的优化解决方案。

Any solution/ opinion/ way around will be really helpful.任何解决方案/意见/方式都会非常有帮助。

Expected output I want to make my t_df 's schema exactly same as my r_df 's schema.预期的输出我想让我的t_df的架构与我的r_df的架构完全相同。

below code is un-tested but should prescribe how to do it.下面的代码未经测试,但应该规定如何去做。 (written from memory without testing.) There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas. (从记忆中编写,未经测试。)可能有一种方法可以从结构中获取字段,但我不知道如何所以我有兴趣听到其他人的想法。

  1. Extract struct column names and types.提取结构列名称和类型。
  2. Find columns that need to be dropped查找需要删除的列
  3. Drop columns删除列
  4. rebuild struts according to r_df.根据 r_df 重建 struts。
stucts_in_r_df = [  field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields

struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
 struct_columns.append(r_df\
  .select(
   "$structs.*"
  ).columns
 )

missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.

# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens  col("$struct.$field") to get the values out of the fields,
result = r_df.union(
 tdf\ 
  .select(*(
   [ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
   [ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
  )
 )
  

Here's a way once you have the union to pull back the struct:一旦你有联合来拉回结构,这是一种方法:

result = result\
 .select(
   col("_id"),
   struct( col("sms").alias("sms") ).alias("notificationsSend"),
   struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
    ).alias("recordingDetails") #reconstitue struct with 
  )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM