简体   繁体   中英

How to merge two dataframes in pyspark with different columns inside struct or array?

Lets say, there are two data-frames. Reference dataframe and Target dataframe .

Reference DF is a reference schema.

Schema for reference DF (r_df)

r_df.printSchema()
root
 |-- _id: string (nullable = true)
 |-- notificationsSend: struct (nullable = true)
 |    |-- mail: boolean (nullable = true)
 |    |-- sms: boolean (nullable = true)
 |-- recordingDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- channelName: string (nullable = true)
 |    |    |-- fileLink: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- recorderId: string (nullable = true)
 |    |    |-- resourceId: string (nullable = true)

However, target data-frame schema is dynamic in nature.

Schema for target DF (t_df)

t_df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- notificationsSend: struct (nullable = true)
 |    |-- sms: string (nullable = true)
 |-- recordingDetails: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- channelName: string (nullable = true)
 |    |    |-- fileLink: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- recorderId: string (nullable = true)
 |    |    |-- resourceId: string (nullable = true)
 |    |    |-- createdBy: string (nullable = true)

So we observe multiple changes in target's schema.

  1. Columns inside t_df struct or array can have more or less columns.
  2. Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df )

I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.

Any solution/ opinion/ way around will be really helpful.

Expected output I want to make my t_df 's schema exactly same as my r_df 's schema.

below code is un-tested but should prescribe how to do it. (written from memory without testing.) There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.

  1. Extract struct column names and types.
  2. Find columns that need to be dropped
  3. Drop columns
  4. rebuild struts according to r_df.
stucts_in_r_df = [  field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields

struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
 struct_columns.append(r_df\
  .select(
   "$structs.*"
  ).columns
 )

missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.

# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens  col("$struct.$field") to get the values out of the fields,
result = r_df.union(
 tdf\ 
  .select(*(
   [ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
   [ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
  )
 )
  

Here's a way once you have the union to pull back the struct:

result = result\
 .select(
   col("_id"),
   struct( col("sms").alias("sms") ).alias("notificationsSend"),
   struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
    ).alias("recordingDetails") #reconstitue struct with 
  )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM