简体   繁体   English

Spark from_json 无一例外

[英]Spark from_json No Exception

I am working with Spark 2.1 (scala 2.11).我正在使用 Spark 2.1 (scala 2.11)。

I want to load json formatted strings with a defined schema, from a dataframe into another dataframe. I have tried out some solutions but the least expensive turns out to be the standard column function from_json.我想加载具有定义模式的 json 格式化字符串,从 dataframe 到另一个 dataframe。我已经尝试了一些解决方案,但最便宜的是标准列 function from_json。 I tried out an example( https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-collection.html#from_json ) with this function which is giving me unexpected results.我用这个 function 尝试了一个例子( https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-collection.html#from_json ),这给了我意想不到的结果。

val df = spark.read.text("testFile.txt")

df.show(false)

+----------------+
|value           |
+----------------+
|{"a": 1, "b": 2}|
|{bad-record     |
+----------------+


df.select(from_json(col("value"),
      StructType(List(
                  StructField("a",IntegerType),
                  StructField("b",IntegerType)
                ))
    )).show(false)


+-------------------+
|jsontostruct(value)|
+-------------------+
|[1,2]              |
|null               |
+-------------------+

This behavior is similar to mode:PERMISSIVE which is not the default.此行为类似于 mode:PERMISSIVE ,它不是默认的。 By default, it is set to FAILFAST mode meaning it should throw an exception whenever the input data & enforced schema are not matching.默认情况下,它设置为 FAILFAST 模式,这意味着只要输入数据和强制模式不匹配,它就会抛出异常。

I tried the load the testFile.txt with DataFrameReader(JSON DataSource and FAILFAST mode ) and successfully caught an exception.我尝试使用 DataFrameReader(JSON DataSource 和 FAILFAST 模式)加载 testFile.txt 并成功捕获了异常。

spark.read.option("mode","FAILFAST").json("test.txt").show(false)

---
Caused by: org.apache.spark.sql.catalyst.json.SparkSQLJsonProcessingException: Malformed line in FAILFAST mode: {bad-record
---

Though the Parsing Mode is same in both cases, why are respective outputs so different?尽管两种情况下的解析模式相同,但为什么各自的输出如此不同?

That is an expected behavior.这是预期的行为。 from_json is a SQL function, and there is no concept of exception (intentional one) at this level. from_json是一个 SQL 函数,在这一层没有异常(有意的)的概念。 If operation fails the result is undefined NULL .如果操作失败,则结果为 undefined NULL

While from_json provides options argument, which allows you to set JSON reader option, this behavior, for the reason mentioned above, cannot be overridden.虽然from_json提供了options参数,它允许您设置 JSON reader 选项,但由于上述原因,无法覆盖此行为。

On a side note default mode for DataFrameReader is permissive.附带说明一下, DataFrameReader默认模式是允许的。

Note that you are reading the file as text file and converting it to json.请注意,您正在将文件作为文本文件读取并将其转换为 json。 By default, newline will be delimiter for text files and within a line if you have a valid JSON string, then it will convert correctly with the schema that you define in from_json() method.默认情况下,如果您有有效的 JSON 字符串,换行符将作为文本文件的分隔符,并且在一行内,它将使用您在 from_json() 方法中定义的模式正确转换。

If there are blank lines or an invalid JSON text, then you will get NULL.如果有空行或无效的 JSON 文本,则您将获得 NULL。

Check this out:看一下这个:

val df = spark.read.text("in/testFile.txt")
println("Default show()")
df.show(false)

println("Using the from_json method ")
df.select(from_json(col("value"),
  StructType(List(
    StructField("a",IntegerType),
    StructField("b",IntegerType)
  ))
)).show(false)

when the in/testFile.txt is with below content,当 in/testFile.txt 包含以下内容时,

{"a": 1, "b": 2 }

it prints它打印

Default show()
+-----------------+
|value            |
+-----------------+
|{"a": 1, "b": 2 }|
+-----------------+

Using the from_json method 
+--------------------+
|jsontostructs(value)|
+--------------------+
|[1,2]               |
+--------------------+

when your input is with a blank line当您的输入为空行时

{"a": 1, "b": 2 }
// Blank line

the result is结果是

Default show()
+-----------------+
|value            |
+-----------------+
|{"a": 1, "b": 2 }|
|                 |
+-----------------+

Using the from_json method 
+--------------------+
|jsontostructs(value)|
+--------------------+
|[1,2]               |
|null                |
+--------------------+

To add to @user11022201 answer - looks like options argument can achieve the desired FAILFAST behavior.添加到@user11022201 答案 - 看起来options参数可以实现所需的FAILFAST行为。 The code below is in pyspark and tested with Spark 3.2.2下面的代码在 pyspark 中并使用 Spark 3.2.2 测试

import pyspark
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType

spark_session = pyspark.sql.SparkSession.builder.master("local[*]").appName("test").getOrCreate()

data = [
    {'value': '{"a": 1, "b": 2}'},
    {'value': '{bad-record'},
]

df = spark_session.createDataFrame(data)

schema = StructType([
    StructField("a", IntegerType()),
    StructField("b", IntegerType())
])

# If options are empty then the error does not happen and null values are added to the dataframe
# options = {}
options = {"mode": "FAILFAST"}

parsed_json_df = df.select(F.from_json(F.col("value"), schema, options))
parsed_json_df.show()

The result of the code above is an exception, which is the desired behavior:上面代码的结果是一个异常,这是期望的行为:

org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedRecordsDetectedInRecordParsingError(QueryExecutionErrors.scala:1236)
    at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:68)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM