Spark：忽略或处理DataSet选择错误

Question

I'm testing some prototype application. 我正在测试一些原型应用程序。 We have json data with nested fields. 我们有带有嵌套字段的json数据。 I'm trying to pull some field using following json and code: 我正在尝试使用以下json和代码提取一些字段：

Feed: {name: "test",[Record: {id: 1 AllColumns: {ColA: "1",ColB: "2"}}...]}

Dataset<Row> completeRecord = sparkSession.read().json(inputPath);
final Dataset<Row> feed = completeRecord.select(completeRecord.col("Feed.Record.AllColumns"));

I have around 2000 files with such records. 我有大约2000个带有此类记录的文件。 I have tested some files individually and they are working fine. 我已经分别测试了一些文件，它们工作正常。 But for some file I am getting below error on second line: 但是对于某些文件，我在第二行得到以下错误：

org.apache.spark.sql.AnalysisException: Can't extract value from Feed#8.Record: need struct type but got string; org.apache.spark.sql.AnalysisException：无法从Feed＃8中提取值。Record：需要结构类型但得到了字符串；

I'm not sure what is going on here. 我不确定这是怎么回事。 But I would like to either handle this error gracefully and log which file has that record. 但我想优雅地处理此错误，并记录哪个文件具有该记录。 Also, is there any way to ignore this and continue with rest of the files? 另外，有什么方法可以忽略这一点并继续处理其余文件吗？

Answer 1

Answering my own question based on what I have learned. 根据我所学的知识回答我自己的问题。 There are couple of ways to solve it. 有几种解决方法。 Spark provides options to ignore corrupt files and corrupt records. Spark提供了忽略损坏的文件和损坏的记录的选项。

To ignore corrupt files one can set following flag to true: 要忽略损坏的文件，可以将以下标志设置为true：

spark.sql.files.ignoreCorruptFiles=true spark.sql.files.ignoreCorruptFiles =真

For more fine grained control and to ignore bad records instead of ignoring the complete file. 为了获得更精细的控制，并忽略不良记录，而不是忽略整个文件。 You can use one of three modes that Spark api provides. 您可以使用Spark api提供的三种模式之一。

According to DataFrameReader api 根据DataFrameReader api

mode (default PERMISSIVE ): allows a mode for dealing with corrupt records during parsing. 模式（默认为PERMISSIVE ）：允许使用一种模式在解析期间处理损坏的记录。 PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a new field configured by columnNameOfCorruptRecord. PERMISSIVE ：遇到损坏的记录时，将其他字段设置为null，并将格式错误的字符串放入由columnNameOfCorruptRecord配置的新字段中。 When a schema is set by user, it sets null for extra fields. 由用户设置架构时，它会为其他字段设置null。
DROPMALFORMED : ignores the whole corrupted records. DROPMALFORMED ：忽略整个损坏的记录。
FAILFAST : throws an exception when it meets corrupted records. FAILFAST ：遇到损坏的记录时引发异常。

PERMISSIVE mode worked really well for me but when I provided my own schema Spark filled missing attributes with null instead of marking it corrupt record. PERMISSIVE模式对我来说确实非常有效，但是当我提供自己的架构时，Spark会将缺少的属性填充为null而不是将其标记为损坏的记录。

Answer 2

The exception says that one of the json files differs in its structure and that the path Feed.Record.AllColumns does not exist in this specific file. 异常表明，其中一个json文件的结构不同，并且此特定文件中不存在Feed.Record.AllColumns路径。

Based on this method 基于这种方法

private boolean pathExists(Dataset<Row> df, String path) {
  try {
    df.apply(path);
    return true;
  }
  catch(Exception ex){
    return false;
  }
}

you can decide if you execute the select or log an error message: 您可以决定执行select还是记录错误消息：

if(pathExists(completeRecord, "Feed.Record.AllColumns") {
  final Dataset<Row> feed = completeRecord.select(completeRecord.col("Feed.Record.AllColumns"));
  //continue with processing
}
else {
  //log error message
}

Spark：忽略或处理DataSet选择错误

问题描述

2 个解决方案

解决方案1
2 2018-03-15 22:32:04

解决方案2
1 2018-03-15 20:51:08

Spark：忽略或处理DataSet选择错误

问题描述

2 个解决方案

解决方案1 2 2018-03-15 22:32:04

解决方案2 1 2018-03-15 20:51:08

解决方案1
2 2018-03-15 22:32:04

解决方案2
1 2018-03-15 20:51:08