Spark from_json 无一例外

Question

I am working with Spark 2.1 (scala 2.11).我正在使用 Spark 2.1 (scala 2.11)。

I want to load json formatted strings with a defined schema, from a dataframe into another dataframe. I have tried out some solutions but the least expensive turns out to be the standard column function from_json.我想加载具有定义模式的 json 格式化字符串，从 dataframe 到另一个 dataframe。我已经尝试了一些解决方案，但最便宜的是标准列 function from_json。 I tried out an example( https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-collection.html#from_json ) with this function which is giving me unexpected results.我用这个 function 尝试了一个例子（ https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-collection.html#from_json ），这给了我意想不到的结果。

val df = spark.read.text("testFile.txt")

df.show(false)

+----------------+
|value           |
+----------------+
|{"a": 1, "b": 2}|
|{bad-record     |
+----------------+


df.select(from_json(col("value"),
      StructType(List(
                  StructField("a",IntegerType),
                  StructField("b",IntegerType)
                ))
    )).show(false)


+-------------------+
|jsontostruct(value)|
+-------------------+
|[1,2]              |
|null               |
+-------------------+

This behavior is similar to mode:PERMISSIVE which is not the default.此行为类似于 mode:PERMISSIVE ，它不是默认的。 By default, it is set to FAILFAST mode meaning it should throw an exception whenever the input data & enforced schema are not matching.默认情况下，它设置为 FAILFAST 模式，这意味着只要输入数据和强制模式不匹配，它就会抛出异常。

I tried the load the testFile.txt with DataFrameReader(JSON DataSource and FAILFAST mode ) and successfully caught an exception.我尝试使用 DataFrameReader（JSON DataSource 和 FAILFAST 模式）加载 testFile.txt 并成功捕获了异常。

spark.read.option("mode","FAILFAST").json("test.txt").show(false)

---
Caused by: org.apache.spark.sql.catalyst.json.SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {bad-record
---

Though the Parsing Mode is same in both cases, why are respective outputs so different?尽管两种情况下的解析模式相同，但为什么各自的输出如此不同？

Answer 1

That is an expected behavior.这是预期的行为。 from_json is a SQL function, and there is no concept of exception (intentional one) at this level. from_json是一个 SQL 函数，在这一层没有异常（有意的）的概念。 If operation fails the result is undefined NULL .如果操作失败，则结果为 undefined NULL 。

While from_json provides options argument, which allows you to set JSON reader option, this behavior, for the reason mentioned above, cannot be overridden.虽然from_json提供了options参数，它允许您设置 JSON reader 选项，但由于上述原因，无法覆盖此行为。

On a side note default mode for DataFrameReader is permissive.附带说明一下， DataFrameReader默认模式是允许的。

Answer 2

Note that you are reading the file as text file and converting it to json.请注意，您正在将文件作为文本文件读取并将其转换为 json。 By default, newline will be delimiter for text files and within a line if you have a valid JSON string, then it will convert correctly with the schema that you define in from_json() method.默认情况下，如果您有有效的 JSON 字符串，换行符将作为文本文件的分隔符，并且在一行内，它将使用您在 from_json() 方法中定义的模式正确转换。

If there are blank lines or an invalid JSON text, then you will get NULL.如果有空行或无效的 JSON 文本，则您将获得 NULL。

Check this out:看一下这个：

val df = spark.read.text("in/testFile.txt")
println("Default show()")
df.show(false)

println("Using the from_json method ")
df.select(from_json(col("value"),
  StructType(List(
    StructField("a",IntegerType),
    StructField("b",IntegerType)
  ))
)).show(false)

when the in/testFile.txt is with below content,当 in/testFile.txt 包含以下内容时，

{"a": 1, "b": 2 }

it prints它打印

Default show()
+-----------------+
|value            |
+-----------------+
|{"a": 1, "b": 2 }|
+-----------------+

Using the from_json method 
+--------------------+
|jsontostructs(value)|
+--------------------+
|[1,2]               |
+--------------------+

when your input is with a blank line当您的输入为空行时

{"a": 1, "b": 2 }
// Blank line

the result is结果是

Default show()
+-----------------+
|value            |
+-----------------+
|{"a": 1, "b": 2 }|
|                 |
+-----------------+

Using the from_json method 
+--------------------+
|jsontostructs(value)|
+--------------------+
|[1,2]               |
|null                |
+--------------------+

Answer 3

To add to @user11022201 answer - looks like options argument can achieve the desired FAILFAST behavior.添加到@user11022201 答案 - 看起来options参数可以实现所需的FAILFAST行为。 The code below is in pyspark and tested with Spark 3.2.2下面的代码在 pyspark 中并使用 Spark 3.2.2 测试

import pyspark
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType

spark_session = pyspark.sql.SparkSession.builder.master("local[*]").appName("test").getOrCreate()

data = [
    {'value': '{"a": 1, "b": 2}'},
    {'value': '{bad-record'},
]

df = spark_session.createDataFrame(data)

schema = StructType([
    StructField("a", IntegerType()),
    StructField("b", IntegerType())
])

# If options are empty then the error does not happen and null values are added to the dataframe
# options = {}
options = {"mode": "FAILFAST"}

parsed_json_df = df.select(F.from_json(F.col("value"), schema, options))
parsed_json_df.show()

The result of the code above is an exception, which is the desired behavior:上面代码的结果是一个异常，这是期望的行为：

org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedRecordsDetectedInRecordParsingError(QueryExecutionErrors.scala:1236)
    at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:68)

Spark from_json 无一例外

问题描述

3 个解决方案

解决方案1
3 已采纳 2019-02-06 10:07:21

解决方案2
0 2019-02-06 15:57:11

解决方案3
0 2023-01-26 17:56:50

Spark from_json 无一例外

问题描述

3 个解决方案

解决方案1 3 已采纳 2019-02-06 10:07:21

解决方案2 0 2019-02-06 15:57:11

解决方案3 0 2023-01-26 17:56:50

解决方案1
3 已采纳 2019-02-06 10:07:21

解决方案2
0 2019-02-06 15:57:11

解决方案3
0 2023-01-26 17:56:50