简体   繁体   English

如何将复杂的嵌套 JSON 结构翻译成 Spark 中的多列 DataFrame

[英]How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame

I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:我正在学习 Scala,并试图从一个大的嵌套 json 文件中过滤 select 几列,以生成 DataFrame。这是 json 的要点:

{
  “meta”: 
    {“a”: 1, b: 2”}    // I want to ignore meta
  “objects”:
  [
    {
         “caucus”: “Progressive”,
     “person” : 
         {
          “name”: “Mary”,
          “party”: “Green Party”,
          “age”: 50,
          “gender”: “female” // etc..
         }
    }, // etc.
   ] 
}

Hence the data looks like this as is, read in with spark.因此,数据看起来像这样,用 spark 读入。

    val df = spark.read.json("file")
    df.show()
+--------------------+--------------------+
|                meta|             objects|
+--------------------+--------------------+
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
+--------------------+--------------------+

Instead of this, I want a DataFrame with the columns: Name |而不是这个,我想要一个 DataFrame 的列:名称 | Party |聚会 | Caucus.党团会议。

I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.我搞砸了 explode() 并将架构复制为 StructType(),但我不确定如何处理一般的嵌套结构。

You can use ".*" on a column of type struct to tranform it it into multiple fields columns:您可以在 struct 类型的列上使用“.*”将其转换为多个字段列:

val df = spark.read.json("file.json")
df.select(col("meta"), explode(col("objects")).as("objects"))
  .select("meta.*", "objects.*")
  .select("a", "b", "caucus", "person.*")
  .show(false)


+---+---+-----------+---+------+----+-----------+
|a  |b  |caucus     |age|gender|name|party      |
+---+---+-----------+---+------+----+-----------+
|1  |2  |Progressive|50 |female|Mary|Green Party|
+---+---+-----------+---+------+----+-----------+

There's no generic way to handle it because of course it depends on the shape of your data.没有通用的方法来处理它,因为它当然取决于数据的形状。 In your case, you want to explode an array, which will create a column called col , that will contain structs.在您的情况下,您想要分解一个数组,这将创建一个名为col的列,该列将包含结构。 You can then access the fields within the struct using the dot notation, so to extract the fields you asked for you can do this:然后,您可以使用点表示法访问结构中的字段,因此要提取您要求的字段,您可以这样做:

df.select(explode_outer($"objects")).
  select(
     $"col.caucus", 
     $"col.person.name",
     $"col.person.party").show

+-----------+----+-----------+
|     caucus|name|      party|
+-----------+----+-----------+
|Progressive|Mary|Green Party|
+-----------+----+-----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM