简体   繁体   English

如何在Scala中获取JSON的结构

[英]How to get the structure of JSON in scala

I have a lot of JSON files which are not structured and I want to get a deeper element and all the element to get to it. 我有很多没有结构化的JSON文件,我想获得更深层次的元素以及到达它的所有元素。

For example : 例如 :

{
"menu": {
    "id": "file",
    "popup": {
        "menuitem": {
                  "module"{
                      "-vdsr": "New",
                      "-sdst": "Open",
                      "-mpoi": "Close" }
        ...
    }
}

In this case the result would be : 在这种情况下,结果将是:

menu.popup.menuitem.module.-vdsr
menu.popup.menuitem.module.-sdst
menu.popup.menuitem.module.-mpoi

I tried Jackson and Json4s and they are efficient to go the last value but, I don't see how I can get the whole structure. 我尝试了JacksonJson4s ,它们可以有效地获得最高价值,但是,我不知道如何获得整个结构。

I want this to run a job with apache spark on very huge JSON files and the structure will be very complex for each. 我希望它在非常大的JSON文件上运行带有Apache Spark的作业,并且每个文件的结构都非常复杂。 I also tried sparkSQL but if I don't know the entire structure I can't get it. 我也尝试了sparkSQL,但是如果我不知道整个结构,我将无法获得它。

What you're asking to do is essentially a tree traversal of an object, where JSON objects are considered nodes with named branches and other JSON types are considered leaves. 您要执行的实际上是对对象的树遍历 ,其中JSON对象被视为具有命名分支的节点,而其他JSON类型被视为叶子。 There are many ways to do this. 有很多方法可以做到这一点。 You might consider making a recursive function that explores the entire tree. 您可能会考虑创建一个探索整个树的递归函数。 Here is an example that works in PlayJson , but it shouldn't be very different in other libraries: 这是一个适用于PlayJson的示例,但在其他库中应该没有太大不同:

import play.api.libs.json._
def unfold(json: JsValue): Seq[String] = json match {
    case JsObject(kvps) => kvps.flatMap {
        case (key, value) => unfold(value).map(path => s"$key.$path")
    }
    case _ => Seq("")
}

The beauty of the new DataFrame API is that it infers schema automatically, as you load the data. 新的DataFrame API优点在于,它在加载数据时自动推断模式。
It does so by doing a one-time pass over the whole data set. 它通过一次性遍历整个数据集来做到这一点。

I suggest that you load your json and then play with transformations and see what you can come up with. 我建议您加载您的json,然后进行转换,看看您能想到什么。 You can remove or select a subset of columns, filter rows, aggregate, map over them, anything really. 您可以删除或选择列的子集,过滤行,进行聚合,在其上进行映射,进行任何实际操作。

Eg: 例如:

// you can use globing to load multiple files
val jsonTbl = sqlContext.load("path to json file", "json")

// print the inferred schema
jsonTbl.printSchema

// now you can use the DataFrame API to transform the data set, e.g.
val outputTbl = jsonTbl
    .filter("menu.popup.menuitem.module.-vdsr = 'Some value'")
    .groupBy("menu.popup.menuitem.module.-vdsr").count
    .select("menu.popup.menuitem.module.-sdst", "other fields/columns")

outputTbl.show

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM