如何将复杂的嵌套 JSON 结构翻译成 Spark 中的多列 DataFrame

Question

I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:我正在学习 Scala，并试图从一个大的嵌套 json 文件中过滤 select 几列，以生成 DataFrame。这是 json 的要点：

{
  “meta”: 
    {“a”: 1, b: 2”}    // I want to ignore meta
  “objects”:
  [
    {
         “caucus”: “Progressive”,
     “person” : 
         {
          “name”: “Mary”,
          “party”: “Green Party”,
          “age”: 50,
          “gender”: “female” // etc..
         }
    }, // etc.
   ] 
}

Hence the data looks like this as is, read in with spark.因此，数据看起来像这样，用 spark 读入。

    val df = spark.read.json("file")
    df.show()
+--------------------+--------------------+
|                meta|             objects|
+--------------------+--------------------+
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
+--------------------+--------------------+

Instead of this, I want a DataFrame with the columns: Name |而不是这个，我想要一个 DataFrame 的列：名称 | Party |聚会 | Caucus.党团会议。

I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.我搞砸了 explode() 并将架构复制为 StructType()，但我不确定如何处理一般的嵌套结构。

Answer 1

You can use ".*" on a column of type struct to tranform it it into multiple fields columns:您可以在 struct 类型的列上使用“.*”将其转换为多个字段列：

val df = spark.read.json("file.json")
df.select(col("meta"), explode(col("objects")).as("objects"))
  .select("meta.*", "objects.*")
  .select("a", "b", "caucus", "person.*")
  .show(false)


+---+---+-----------+---+------+----+-----------+
|a  |b  |caucus     |age|gender|name|party      |
+---+---+-----------+---+------+----+-----------+
|1  |2  |Progressive|50 |female|Mary|Green Party|
+---+---+-----------+---+------+----+-----------+

Answer 2

There's no generic way to handle it because of course it depends on the shape of your data.没有通用的方法来处理它，因为它当然取决于数据的形状。 In your case, you want to explode an array, which will create a column called col , that will contain structs.在您的情况下，您想要分解一个数组，这将创建一个名为col的列，该列将包含结构。 You can then access the fields within the struct using the dot notation, so to extract the fields you asked for you can do this:然后，您可以使用点表示法访问结构中的字段，因此要提取您要求的字段，您可以这样做：

df.select(explode_outer($"objects")).
  select(
     $"col.caucus", 
     $"col.person.name",
     $"col.person.party").show

+-----------+----+-----------+
|     caucus|name|      party|
+-----------+----+-----------+
|Progressive|Mary|Green Party|
+-----------+----+-----------+

如何将复杂的嵌套 JSON 结构翻译成 Spark 中的多列 DataFrame

问题描述

2 个解决方案

解决方案1
1 已采纳 2023-01-31 00:03:15

解决方案2
0 2023-01-30 22:05:39

如何将复杂的嵌套 JSON 结构翻译成 Spark 中的多列 DataFrame

问题描述

2 个解决方案

解决方案1 1 已采纳 2023-01-31 00:03:15

解决方案2 0 2023-01-30 22:05:39

解决方案1
1 已采纳 2023-01-31 00:03:15

解决方案2
0 2023-01-30 22:05:39