简体   繁体   English

使用 Pyspark 比较 dataframe 的架构

[英]Comparing schema of dataframe using Pyspark

I have a data frame (df).我有一个数据框 (df)。 For showing its schema I use:为了显示它的模式,我使用:

from pyspark.sql.functions import *
df1.printSchema()

And I get the following result:我得到以下结果:

#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

Sometimes the schema changes (the column type or name):有时架构会更改(列类型或名称):

df2.printSchema()


 #root
        # |-- name: array (nullable = true)
        # |-- gender: integer (nullable = true)
        # |-- age: long (nullable = true)

I would like to compare between the two schemas ( df1 and df2 ) and get only the differences in types and columns names (Sometimes the column can move to another position).我想比较两个模式( df1df2 )并仅获取类型和列名称的差异(有时该列可以移动到另一个位置)。 The results should be a table (or data frame) something like this:结果应该是一个像这样的表(或数据框):

   column                df1          df2     diff                       
    name:               string       array     type                             
    gender:              N/A         integer   new column 

( age column is the same and didn't change. In case of omission of column there will be indication 'omitted' ) How can I do it if efficiently if I have many columns in each? age列是相同的并且没有改变。如果遗漏了列,则会显示'omitted' )如果每个列中有很多列,我该如何有效地做到这一点?

You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below您可以尝试使用来自 DF1 和 DF2 的元数据创建两个 Pandas 数据帧,如下所示

pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])

and then join those two pandas dataframes through 'outer' join?然后通过“外部”连接加入这两个熊猫数据框?

Without any external library, we can find the schema difference using没有任何外部库,我们可以使用


def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
    s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
    s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
    difference = (
        s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
        .where(s1.d1_type.isNull() | s2.d2_type.isNull())
        .select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
        .fillna("")
    )
    return difference

  • fillna is optional. fillna 是可选的。 I prefer to view them as empty string.我更喜欢将它们视为空字符串。
  • in where clause we use type because this will help us to show even if column exists in both dataframe but they have different schemas.在 where 子句中,我们使用 type ,因为这将帮助我们显示即使列存在于两个数据框中但它们具有不同的模式。
  • this will also show all columns that are in second dataframe but not in first dataframe这还将显示第二个数据框中但不在第一个数据框中的所有列

Usage:用法:

diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)

A custom function that could be useful for someone.自定义 function 可能对某人有用。

def SchemaDiff(DF1, DF2):
  # Getting schema for both dataframes in a dictionary
  DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
  DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
    
   # Column present in DF1 but not in DF2
  DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
  for column_name in DF1MinusDF2:
    DF1MinusDF2[column_name] = DF1Schema[column_name]
  

  # Column present in DF2 but not in DF1
  DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
  for column_name in DF2MinusDF1:
    DF2MinusDF1[column_name] = DF2Schema[column_name]
  
  # Find data type changed in DF1 as compared to DF2
  UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
  UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
  DF1DataTypesChanged = {}
  for column_name in UpdatedDF1Schema:
    if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
      DF1DataTypesChanged[column_name] = DF2Schema[column_name]
  
  
  return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged

你可以简单地使用

df1.printSchema() == df2.printSchema()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM