在 Palantir Foundry 中，我能否找到导致数据集中架构错误的 CSV 文件？

Question

I'm seeing errors like the following when building downstream of some datasets containing CSV files:在构建包含 CSV 文件的某些数据集的下游时，我看到如下错误：

Caused by: java.lang.IllegalStateException: Header specifies 185 column types but line split into 174: "SUSPECT STRING","123...引起：java.lang.IllegalStateException: Header 指定 185 列类型，但行拆分为 174："SUSPECT STRING","123...

or或者

Caused by: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Exception parsing 'SUSPECT STRING' into a IntegerType$ for column "COLOUR_ID": Unable to deserialize value using com.palantir.spark.parsers.text.converters.IntegerConverter.原因：java.lang.RuntimeException：编码时出错：java.lang.RuntimeException：将“SUSPECT STRING”解析为“COLOUR_ID”列的 IntegerType$ 时出现异常：无法使用 com.palantir.spark.parsers.text 反序列化值。转换器。整数转换器。 The value being deserialized was: SUSPECT STRING被反序列化的值为：SUSPECT STRING

Looking at the errors it seems to me like some of my CSV files have the wrong schema.查看错误，在我看来，我的一些 CSV 文件的架构错误。 How can I find which ones?我怎样才能找到哪些？

Answer 1

One technique you could use would be to:您可以使用的一种技术是：

create a transform that reads the CSV files in as if they were unstructured text files, then创建一个转换来读取 CSV 文件，就好像它们是非结构化文本文件一样，然后
filter the resulting DataFrame down to just the suspect rows, as identified by the extracts contained in the error message将生成的 DataFrame 过滤到仅可疑行，如错误消息中包含的提取所标识的那样

Below is an example of such a transform:以下是此类转换的示例：

from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from transforms.verbs.dataframes import union_many


def read_files(spark_session, paths):
    parsed_dfs = []
    for file_name in paths:
        parsed_df = (
            spark_session.read.text(file_name)
            .filter(F.col("value").contains(F.lit("SUSPECT STRING")))
            .withColumn("_filename", F.lit(file_name))
        )
        parsed_dfs += [parsed_df]
    output_df = union_many(*parsed_dfs, how="wide")
    return output_df


@transform(
    output_dataset=Output("my_output"),
    input_dataset=Input("my_input"),
)
def compute(ctx, input_dataset, output_dataset):
    session = ctx.spark_session
    input_filesystem = input_dataset.filesystem()
    hadoop_path = input_filesystem.hadoop_path
    files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
    output_df = read_files(session, files)
    output_dataset.write_dataframe(output_df)

This would then output the rows of interest along with the paths to the files they're in.然后，这将输出感兴趣的行以及它们所在文件的路径。

在 Palantir Foundry 中，我能否找到导致数据集中架构错误的 CSV 文件？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-06-20 14:52:24

在 Palantir Foundry 中，我能否找到导致数据集中架构错误的 CSV 文件？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-06-20 14:52:24

解决方案1
0 已采纳 2022-06-20 14:52:24