简体   繁体   中英

Comparing schema of dataframe using Pyspark

I have a data frame (df). For showing its schema I use:

from pyspark.sql.functions import *
df1.printSchema()

And I get the following result:

#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

Sometimes the schema changes (the column type or name):

df2.printSchema()


 #root
        # |-- name: array (nullable = true)
        # |-- gender: integer (nullable = true)
        # |-- age: long (nullable = true)

I would like to compare between the two schemas ( df1 and df2 ) and get only the differences in types and columns names (Sometimes the column can move to another position). The results should be a table (or data frame) something like this:

   column                df1          df2     diff                       
    name:               string       array     type                             
    gender:              N/A         integer   new column 

( age column is the same and didn't change. In case of omission of column there will be indication 'omitted' ) How can I do it if efficiently if I have many columns in each?

You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below

pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])

and then join those two pandas dataframes through 'outer' join?

Without any external library, we can find the schema difference using


def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
    s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
    s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
    difference = (
        s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
        .where(s1.d1_type.isNull() | s2.d2_type.isNull())
        .select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
        .fillna("")
    )
    return difference

  • fillna is optional. I prefer to view them as empty string.
  • in where clause we use type because this will help us to show even if column exists in both dataframe but they have different schemas.
  • this will also show all columns that are in second dataframe but not in first dataframe

Usage:

diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)

A custom function that could be useful for someone.

def SchemaDiff(DF1, DF2):
  # Getting schema for both dataframes in a dictionary
  DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
  DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
    
   # Column present in DF1 but not in DF2
  DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
  for column_name in DF1MinusDF2:
    DF1MinusDF2[column_name] = DF1Schema[column_name]
  

  # Column present in DF2 but not in DF1
  DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
  for column_name in DF2MinusDF1:
    DF2MinusDF1[column_name] = DF2Schema[column_name]
  
  # Find data type changed in DF1 as compared to DF2
  UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
  UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
  DF1DataTypesChanged = {}
  for column_name in UpdatedDF1Schema:
    if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
      DF1DataTypesChanged[column_name] = DF2Schema[column_name]
  
  
  return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged

你可以简单地使用

df1.printSchema() == df2.printSchema()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM