尝试将模式应用于 JSON 数据时，SPARK dataframe 返回 null

Question

I'm using the SPARK Java API to read a text file, convert it to JSON, and then apply a schema to it.我正在使用 SPARK Java API 读取文本文件，将其转换为 JSON，然后对其应用架构。 The schema can vary based on a mapping table in the database, which is why I need to first convert the file to JSON so the schema mapping does not have to be in column order.架构可以根据数据库中的映射表而有所不同，这就是为什么我需要首先将文件转换为 JSON 以便架构映射不必按列顺序。 Here is what I've done:这是我所做的：

// Defined the schema (basic representation)
StructType myschema = new StructType().add("a", DataTypes.StringType, true)
                                      .add("b", DataTypes.StringType, true)
                                      .add("x", DataTypes.StringType, true)
                                      .add("y", DataTypes.IntegerType, true)
                                      .add("z", DataTypes.BooleanType, true);

//Reading a pipe delimited text file as JSON, the file has less columns than myschema
Dataset<String> data = spark.read().option("delimiter","|").option("header","true").csv(myFile).toJSON();

The above table returns something like this:上表返回如下内容：

data.show(false);

|value|
+----------------------------------------+
|      {"x":"name1","z":"true","y":"1234"}|
|      {"x":"name2","z":"false","y":"1445"}|
|      {"x":"name3","z":"true",:y":"1212"}|

My issue comes when I run this:当我运行这个时，我的问题出现了：

Dataset<Row> data_with_schema = spark.read().schema(myschema).json(data);

Because my result turns into this:因为我的结果变成了这样：

data_with_schema.show(false);
|x|y|z|
+-------+-------+-------+
|null  |null  |null  |
|null  |null  |null  |
|null  |null  |null  |

I read on stackoverflow that this might be because I'm trying to cast json strings as integers.我在 stackoverflow 上读到这可能是因为我试图将 json 字符串转换为整数。 However, I tried to define the data variable as a Row Dataset instead of String Dataset but there was an Incompatible Types error.但是，我尝试将数据变量定义为行数据集而不是字符串数据集，但出现了不兼容的类型错误。 I'm not sure what the workaround is or what the real issue is.我不确定解决方法是什么或真正的问题是什么。

Answer 1

Figured out the problem:想通了问题：

If there is data in the inputted file that cannot have a schema applied to it, it will return Null for ALL the data in your table.如果输入的文件中存在无法应用架构的数据，它将为表中的所有数据返回 Null。 For example: "1n" is impossible to convert to integer.例如：“1n”是不可能转换成integer的。 If a DataTypes.IntegerType is applied to the column that contains "1n", then the whole table with have null values.如果将 DataTypes.IntegerType 应用于包含“1n”的列，则整个表具有 null 值。

Answer 2

I think this is happening due to a data type mismatch in JSON and defined schema.我认为这是由于 JSON 和定义的架构中的数据类型不匹配而发生的。 as an example, in JSON attribute has "age" with integer but schema has defined "age" with String type.例如，在 JSON 中，属性具有 integer 的“年龄”，但架构定义了字符串类型的“年龄”。 Due to that mismatch, all data get null.由于这种不匹配，所有数据都得到 null。

尝试将模式应用于 JSON 数据时，SPARK dataframe 返回 null

问题描述

2 个解决方案

解决方案1
7 2019-11-23 23:19:58

解决方案2
0 2021-06-08 13:03:14

尝试将模式应用于 JSON 数据时，SPARK dataframe 返回 null

问题描述

2 个解决方案

解决方案1 7 2019-11-23 23:19:58

解决方案2 0 2021-06-08 13:03:14

解决方案1
7 2019-11-23 23:19:58

解决方案2
0 2021-06-08 13:03:14