如何读取嵌套的JSON进行聚合？

Question

我是Spark的新手。 我要做的就是读取嵌套的json，并根据特定条件对它们进行分组。 例如：如果json包含一个人的详细信息，例如他的城市和邮政编码。 我想将属于同一城市和邮政编码的人归为一组。

在将jsons读入DataSet之前，我已经取得了进步。 但是我不知道如何将它们分组。

我的嵌套JSON格式为

{
  "entity": {
    "name": "SJ",
    "id": 31
  },
  "hierarchy": {
    "state": "TN",
    "city": "CBE"
  },
  "data": {}}

这是我编写的用于从文件读取嵌套json的代码。

public void groupJsonString(SparkSession spark) {
    Dataset<Row> studentRecordDS = spark.read()
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
            .json("/home/shiney/Documents/NGA/sparkJsonFiles/*.json");
    StructType st = studentRecordDS.schema();


    List<StructType> nestedList = new ArrayList<>();
    for(StructField field : st.fields()) {
        nestedList.add((StructType)field.dataType());
    }   

}

Answer 1

TL; DR使用spark.read.json （如您所做的），然后在select “ flatten”运算符。

（我使用Scala，而将转换为Java作为家庭练习：））

让我们使用您的示例。

$ cat ../datasets/sample.json
{
  "entity": {
    "name": "SJ",
    "id": 31
  },
  "hierarchy": {
    "state": "TN",
    "city": "CBE"
  },
  "data": {}
}

代码可能如下（再次是Scala）。

val entities = spark
  .read
  .option("multiLine", true)
  .json("../datasets/sample.json")
scala> entities.printSchema
root
 |-- entity: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |-- hierarchy: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)

让我们展平entity和hierarchy顶级列。

scala> entities.select("entity.*", "hierarchy.*").show
+---+----+----+-----+
| id|name|city|state|
+---+----+----+-----+
| 31|  SJ| CBE|   TN|
+---+----+----+-----+

聚合现在应该很容易了。

如何读取嵌套的JSON进行聚合？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-12-07 13:46:06

如何读取嵌套的JSON进行聚合？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-12-07 13:46:06

解决方案1
2 已采纳 2017-12-07 13:46:06