![](/img/trans.png)
[英]How do I set value to 0 of column with multiple statements in Pyspark and Palantir Foundry
[英]How do I check a column always has the same value in Palantir Foundry?
我有一个数据集,其中有一列我总是希望具有相同的值。 它表示数据模式版本,所以我知道我正在正确反序列化我的数据。 如果存在具有不同架构版本的值,我如何确保收到警报?
数据 | 版本 |
---|---|
{“核心价值”} | 1 |
{“键”:“值2”} | 1 |
(如果有一行版本,= 1,我想提醒)
您可以使用Spark assert_true function来做到这一点。 但是,如果您在删除的列上使用此 function,Spark 将“优化”掉期望,因此将其与有助于 output 的列合并。
例如:
from transforms.api import transform_df, Input, Output
from pyspark.sql import functions as F, types as T
@transform_df(
Output("/path/to/output"),
source_df=Input("/path/to/input"),
)
def compute(source_df):
return (
source_df
# assert the version is consistent
.withColumn("data", F.coalesce(F.expr("assert_true(version = '1')"), "data"))
.drop("version")
# parse the data, knowing we can expect the correct schema
.withColumn("data", F.from_json(F.col("data"), T.StructType([
T.StructField("key", T.StringType())
]))
)
如果在转换结束或开始时验证版本适合您的需要,您可以使用数据期望来执行此操作。 例如:
from transforms.api import transform_df, Input, Output, Check
from pyspark.sql import functions as F, types as T
from transforms import expectations as E
@transform_df(
Output("/path/to/output"),
source_df=Input("/path/to/input", checks=[
# assert the version is consistent
# usually you'd put checks on the output dataset
# so this might be better placed where the input is created
Check(E.col("version").equals(1), "version: equals 1")
]),
)
def compute(source_df):
return (
source_df
.drop("version")
# parse the data, knowing we can expect the correct schema
.withColumn("data", F.from_json(F.col("data"), T.StructType([
T.StructField("key", T.StringType())
]))
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.