[英]Comparing schema of dataframe using Pyspark
I have a data frame (df).我有一个数据框 (df)。 For showing its schema I use:为了显示它的模式,我使用:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:我得到以下结果:
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Sometimes the schema changes (the column type or name):有时架构会更改(列类型或名称):
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
I would like to compare between the two schemas ( df1
and df2
) and get only the differences in types and columns names (Sometimes the column can move to another position).我想比较两个模式( df1
和df2
)并仅获取类型和列名称的差异(有时该列可以移动到另一个位置)。 The results should be a table (or data frame) something like this:结果应该是一个像这样的表(或数据框):
column df1 df2 diff
name: string array type
gender: N/A integer new column
( age
column is the same and didn't change. In case of omission of column there will be indication 'omitted'
) How can I do it if efficiently if I have many columns in each? ( age
列是相同的并且没有改变。如果遗漏了列,则会显示'omitted'
)如果每个列中有很多列,我该如何有效地做到这一点?
You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below您可以尝试使用来自 DF1 和 DF2 的元数据创建两个 Pandas 数据帧,如下所示
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])
and then join those two pandas dataframes through 'outer' join?然后通过“外部”连接加入这两个熊猫数据框?
Without any external library, we can find the schema difference using没有任何外部库,我们可以使用
def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
difference = (
s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
.where(s1.d1_type.isNull() | s2.d2_type.isNull())
.select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
.fillna("")
)
return difference
Usage:用法:
diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)
A custom function that could be useful for someone.自定义 function 可能对某人有用。
def SchemaDiff(DF1, DF2):
# Getting schema for both dataframes in a dictionary
DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
# Column present in DF1 but not in DF2
DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
for column_name in DF1MinusDF2:
DF1MinusDF2[column_name] = DF1Schema[column_name]
# Column present in DF2 but not in DF1
DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
for column_name in DF2MinusDF1:
DF2MinusDF1[column_name] = DF2Schema[column_name]
# Find data type changed in DF1 as compared to DF2
UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
DF1DataTypesChanged = {}
for column_name in UpdatedDF1Schema:
if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
DF1DataTypesChanged[column_name] = DF2Schema[column_name]
return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged
你可以简单地使用
df1.printSchema() == df2.printSchema()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.