使用python在spark中进行模式验证

Question

Validate_shema(df, dic)
   Df2=df.withcolumn('typ_freq',when(df.schema.["Frequency"].dataType != dic["Frequency"], False). Otherwise ('true')
   Df2=df.withcolumn('typ_region',when(df.schema.["Region"].dataType != dic["Region"], False). Otherwise ('true')

Df2.show()

它给了我错误 - 条件必须是一列。

虽然，当我尝试验证长度时 - 比如 - df.withcolumn("len_freq",when(length(df["Freq"]) > dic["Freq"], False).otherwise(True) 这成功了。

谁能告诉解决方案为什么数据类型不起作用？

Answer 1

对于 spark 中的模式验证，我会推荐 Cerberus 库（ https://docs.python-cerberus.org/en/stable/ ） - 有一个关于在 Spark 中使用 Cerberus 的很棒的教程： https : //www.waitingforcode.com /apache-spark/validating-json-apache-spark-cerberus/read

就当前解决方案不起作用的原因而言，您需要转换条件以处理列类型，可能使用lit函数（ https://spark.apache.org/docs/latest/api/python/ pyspark.sql.html#pyspark.sql.functions.lit ) - 类似于：

import pyspark.sql.functions as F
df = df.withColumn("data_type", F.lit(df.schema.["Frequency"].dataType))
df = df.withcolumn('typ_freq',F.when(F.col("data_type") != dic["Frequency"], False).otherwise('true')

祝你好运！

使用python在spark中进行模式验证

问题描述

1 个解决方案

解决方案1
1 2020-03-31 23:43:13

使用python在spark中进行模式验证

问题描述

1 个解决方案

解决方案1 1 2020-03-31 23:43:13

解决方案1
1 2020-03-31 23:43:13