Spark - 从列中读取 JSON 数组

Question

Using Spark 2.11, I've the following Dataset (read from Cassandra table):使用 Spark 2.11，我有以下数据集（从 Cassandra 表中读取）：

+------------+----------------------------------------------------------+
|id         |attributes                                                 |
+------------+----------------------------------------------------------+
|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]     |
+------------+----------------------------------------------------------+

This is the printSchema():这是 printSchema()：

root
 |-- id: string (nullable = true)
 |-- attributes: string (nullable = true)

The attributes column is an array of JSON objects. attributes列是 JSON 对象的数组。 I'm trying to explode it into Dataset but keep failing.我正在尝试将其分解为数据集，但一直失败。 I was trying to define schema as follow:我试图定义模式如下：

StructType type = new StructType()
                .add("id", new IntegerType(), false)
                .add("name", new StringType(), false)
                .add("score", new FloatType(), false)
                .add("snippets", new IntegerType(), false );
        
ArrayType schema = new ArrayType(type, false);

And provide it to from_json as follow:并将其提供给from_json如下：

df = df.withColumn("val", functions.from_json(df.col("attributes"), schema));

This fails with MatchError:这会因 MatchError 而失败：

Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.IntegerType@43756cb (of class org.apache.spark.sql.types.IntegerType)

What's the correct way to do that?这样做的正确方法是什么？

Answer 1

You can specify the schema this way:您可以通过这种方式指定架构：

val schema = ArrayType(
  StructType(Array(
    StructField("id", IntegerType, false),
    StructField("name", StringType, false),
    StructField("score", FloatType, false),
    StructField("snippets", IntegerType, false)
  )),
  false
)

val df1 = df.withColumn("val", from_json(col("attributes"), schema))

df1.show(false)

//+-----------+------------------------------------------------------+------------------------+
//|id         |attributes                                            |val                     |
//+-----------+------------------------------------------------------+------------------------+
//|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]|[[1, function, 10.0, 1]]|
//+-----------+------------------------------------------------------+------------------------+

Or for Java:或者对于 Java：

import static org.apache.spark.sql.types.DataTypes.*;


StructType schema = createArrayType(createStructType(Arrays.asList(
    createStructField("id", IntegerType, false),
    createStructField("name", StringType, false),
    createStructField("score", FloatType, false),
    createStructField("snippets", StringType, false)
)), false);

Answer 2

You can define the schema as a literal string instead:您可以将架构定义为文字字符串：

val df2 = df.withColumn(
    "val",
    from_json(
        df.col("attributes"),
        lit("array<struct<id: int, name: string, score: float, snippets: int>>")
    )
)

df2.show(false)
+-----------+------------------------------------------------------+------------------------+
|id         |attributes                                            |val                     |
+-----------+------------------------------------------------------+------------------------+
|YH8B135U123|[{"id":1,"name":"function","score":10.0,"snippets":1}]|[[1, function, 10.0, 1]]|
+-----------+------------------------------------------------------+------------------------+

If you prefer to use a schema:如果您更喜欢使用架构：

val spark_struct = new StructType()
                .add("id", IntegerType, false)
                .add("name", StringType, false)
                .add("score", FloatType, false)
                .add("snippets", IntegerType, false)

val schema = new ArrayType(spark_struct, false)

val df2 = df.withColumn(
    "val",
    from_json(
        df.col("attributes"),
        schema
    )
)

Two problems with your original code were: (1) you used the reserved keyword type as a variable name, and (2) you don't need to use new in add .原始代码的两个问题是：（1）您使用保留关键字type作为变量名，以及（2）您不需要在add中使用new 。

Spark - 从列中读取 JSON 数组

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-02-25 14:32:47

解决方案2
1 2021-02-25 14:21:13

Spark - 从列中读取 JSON 数组

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-02-25 14:32:47

解决方案2 1 2021-02-25 14:21:13

解决方案1
2 已采纳 2021-02-25 14:32:47

解决方案2
1 2021-02-25 14:21:13