[英]In Palantir Foundry, can I find which CSV file is causing schema errors in a dataset?
I'm seeing errors like the following when building downstream of some datasets containing CSV files:在构建包含 CSV 文件的某些数据集的下游时,我看到如下错误:
Caused by: java.lang.IllegalStateException: Header specifies 185 column types but line split into 174: "SUSPECT STRING","123...引起:java.lang.IllegalStateException: Header 指定 185 列类型,但行拆分为 174:"SUSPECT STRING","123...
or或者
Caused by: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: Exception parsing 'SUSPECT STRING' into a IntegerType$ for column "COLOUR_ID": Unable to deserialize value using com.palantir.spark.parsers.text.converters.IntegerConverter.原因:java.lang.RuntimeException:编码时出错:java.lang.RuntimeException:将“SUSPECT STRING”解析为“COLOUR_ID”列的 IntegerType$ 时出现异常:无法使用 com.palantir.spark.parsers.text 反序列化值。转换器。整数转换器。 The value being deserialized was: SUSPECT STRING被反序列化的值为:SUSPECT STRING
Looking at the errors it seems to me like some of my CSV files have the wrong schema.查看错误,在我看来,我的一些 CSV 文件的架构错误。 How can I find which ones?我怎样才能找到哪些?
One technique you could use would be to:您可以使用的一种技术是:
Below is an example of such a transform:以下是此类转换的示例:
from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from transforms.verbs.dataframes import union_many
def read_files(spark_session, paths):
parsed_dfs = []
for file_name in paths:
parsed_df = (
spark_session.read.text(file_name)
.filter(F.col("value").contains(F.lit("SUSPECT STRING")))
.withColumn("_filename", F.lit(file_name))
)
parsed_dfs += [parsed_df]
output_df = union_many(*parsed_dfs, how="wide")
return output_df
@transform(
output_dataset=Output("my_output"),
input_dataset=Input("my_input"),
)
def compute(ctx, input_dataset, output_dataset):
session = ctx.spark_session
input_filesystem = input_dataset.filesystem()
hadoop_path = input_filesystem.hadoop_path
files = [hadoop_path + "/" + file_name.path for file_name in input_filesystem.ls()]
output_df = read_files(session, files)
output_dataset.write_dataframe(output_df)
This would then output the rows of interest along with the paths to the files they're in.然后,这将输出感兴趣的行以及它们所在文件的路径。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.