简体   繁体   English

在 Palantir Foundry 中,我能否找到导致数据集中架构错误的 CSV 文件?

[英]In Palantir Foundry, can I find which CSV file is causing schema errors in a dataset?

I'm seeing errors like the following when building downstream of some datasets containing CSV files:在构建包含 CSV 文件的某些数据集的下游时,我看到如下错误:

Caused by: java.lang.IllegalStateException: Header specifies 185 column types but line split into 174: "SUSPECT STRING","123...引起:java.lang.IllegalStateException: Header 指定 185 列类型,但行拆分为 174:"SUSPECT STRING","123...

or或者

Caused by: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Exception parsing 'SUSPECT STRING' into a IntegerType$ for column "COLOUR_ID": Unable to deserialize value using com.palantir.spark.parsers.text.converters.IntegerConverter.原因:java.lang.RuntimeException:编码时出错:java.lang.RuntimeException:将“SUSPECT STRING”解析为“COLOUR_ID”列的 IntegerType$ 时出现异常:无法使用 com.palantir.spark.parsers.text 反序列化值。转换器。整数转换器。 The value being deserialized was: SUSPECT STRING被反序列化的值为:SUSPECT STRING

Looking at the errors it seems to me like some of my CSV files have the wrong schema.查看错误,在我看来,我的一些 CSV 文件的架构错误。 How can I find which ones?我怎样才能找到哪些?

One technique you could use would be to:您可以使用的一种技术是:

  1. create a transform that reads the CSV files in as if they were unstructured text files, then创建一个转换来读取 CSV 文件,就好像它们是非结构化文本文件一样,然后
  2. filter the resulting DataFrame down to just the suspect rows, as identified by the extracts contained in the error message将生成的 DataFrame 过滤到仅可疑行,如错误消息中包含的提取所标识的那样

Below is an example of such a transform:以下是此类转换的示例:

from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from transforms.verbs.dataframes import union_many


def read_files(spark_session, paths):
    parsed_dfs = []
    for file_name in paths:
        parsed_df = (
            spark_session.read.text(file_name)
            .filter(F.col("value").contains(F.lit("SUSPECT STRING")))
            .withColumn("_filename", F.lit(file_name))
        )
        parsed_dfs += [parsed_df]
    output_df = union_many(*parsed_dfs, how="wide")
    return output_df


@transform(
    output_dataset=Output("my_output"),
    input_dataset=Input("my_input"),
)
def compute(ctx, input_dataset, output_dataset):
    session = ctx.spark_session
    input_filesystem = input_dataset.filesystem()
    hadoop_path = input_filesystem.hadoop_path
    files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
    output_df = read_files(session, files)
    output_dataset.write_dataframe(output_df)

This would then output the rows of interest along with the paths to the files they're in.然后,这将输出感兴趣的行以及它们所在文件的路径。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Spark中,如何从没有列名的CSV文件中使用架构创建数据框? - In spark, How can I create a dataframe with schema from a CSV file which doesn't have the column names? 如何从 .csv 文件中拆分数据集以进行训练和测试? - How can I split a Dataset from a .csv file for Training and Testing? 从CSV文件向Pandas Dataframe添加数据导致值错误 - Adding data to Pandas Dataframe from a CSV file causing Value Errors 我有带有变量类型的架构 CSV 文件,工作台无法理解,如何导入此 CSV 格式? - I have schema CSV file with variable types, workbench can't understand it, how to import this CSV format? 我该如何改进,以便 CSV 文件可以找到 ID? - How can I improve this so the CSV file can find the ID? 我怎样才能创建schema.ini文件?我需要将.csv文件导出到datagridview - How can I create schema.ini file? I need to export my .csv file to datagridview Pandas output 可以推断 CSV 文件的架构吗? - Can Pandas output inferred schema for a CSV file? 如何读取以“;”分隔的熊猫格式的csv文件? - How can I read the csv file in pandas which is separated with “;”? 如何解析 csv 文件以在第 2 列的文件中找到“失败”并找到第 7 列的平均值 - How do I parse a csv file to find the "fails" in the file which is on column 2 and find the average of column 7 如何删除CSV文件中没有的字段 - How can I remove field which are nil in CSV file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM