简体   繁体   English

我需要从scala中的嵌套json文件创建一个火花数据帧

[英]I need to create a spark dataframe from a nested json file in scala

I have a Json file that looks like this 我有一个看起来像这样的Json文件

{
  "tags": [
    {
      "1": "NpProgressBarTag",
      "2": "userPath",
      "3": "screen",
      "4": 6,
      "12": 9,
      "13": "buttonName",
      "16": 0,
      "17": 10,
      "18": 5,
      "19": 6,
      "20": 1,
      "35": 1,
      "36": 1,
      "37": 4,
      "38": 0,
      "39": "npChannelGuid",
      "40": "npShowGuid",
      "41": "npCategoryGuid",
      "42": "npEpisodeGuid",
      "43": "npAodEpisodeGuid",
      "44": "npVodEpisodeGuid",
      "45": "npLiveEventGuid",
      "46": "npTeamGuid",
      "47": "npLeagueGuid",
      "48": "npStatus",
      "50": 0,
      "52": "gupId",
      "54": "deviceID",
      "55": 1,
      "56": 0,
      "57": "uiVersion",
      "58": 1,
      "59": "deviceOS",
      "60": 1,
      "61": 0,
      "62": "channelLineupID",
      "63": 2,
      "64": "userProfile",
      "65": "sessionId",
      "66": "hitId",
      "67": "actionTime",
      "68": "seekTo",
      "69": "seekFrom",
      "70": "currentPosition"
    }
  ]
}

I tried to create a dataframe using 我试图使用创建数据框

val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()

when I run this I get 当我运行这个我得到

df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

How do we create a df based on contents of "tags" key? 我们如何基于“标签”键的内容创建df? all I need is, pull data out of "tags" and apply case class like this 我所需要的是,从“标签”中提取数据并像这样应用案例类

case class ProgLang (id: String, type: String )

I need to convert this json data into dataframe with Two Column names .toDF(id, Type) Can anyone shed some light on this error? 我需要将此json数据转换为具有两列名称的数据框.toDF(id,Type)有人可以阐明这个错误吗?

You may modify the JSON using Circe . 您可以使用Circe修改JSON。

Given that your values are sometimes Strings and other times Numbers , this was quite complex. 鉴于您的值有时是Strings而其他时候是Numbers ,这非常复杂。

import io.circe._, io.circe.parser._, io.circe.generic.semiauto._

val json = """ ... """ // your JSON here.
val doc = parse(json).right.get
val mappedDoc = doc.hcursor.downField("tags").withFocus { array =>
  array.mapArray { jsons =>
    jsons.map { json =>
      json.mapObject { o =>
        o.mapValues { v =>
          // Cast numbers to strings.
          if (v.isString) v else Json.fromString(v.asNumber.get.toString)
        }
      }
    }
  }
}

final case class ProgLang(id: String, `type`: String )
final case class Tags(tags: List[Map[String, String]])
implicit val TagsDecoder: Decoder[Tags] = deriveDecoder

val tags = mappedDoc.top.get.as[Tags]
val data = for {
  tag <- res29.tags
  (id, _type) <- tag
} yield ProgLang(id, _type)

Now you have a List of ProgLang you may create a DataFrame directly from it, save it as a file with each JSON per line, save it as a CSV file, etc... 现在您有了一个ProgLang列表,您可以直接从中创建一个DataFrame ,将其保存为每行每个JSON的文件,另存为CSV文件,等等。
If the file is very big, you may use fs2 to stream it while transforming, it integrates nicely with Circe . 如果文件很大,则可以在转换时使用fs2对其进行流传输,它与Circe可以很好地集成


DISCLAIMER: I am far from being a "pro" with Circe, this seems over-complicated for doing something which seems like a "simple-task", probably there is a better / cleaner way of doing it (maybe using Optics?), but hey! 免责声明:我远不是Circe的“专家”,这似乎因为做“简单任务”之类的事情而变得过于复杂,也许有更好/更干净的方法(也许使用光学?),但是,嘿! it works! 有用! - anyways, if anyone knows a better way to solve this feel free to edit the question or provide yours . -无论如何,如果有人知道解决此问题的更好方法,请随时编辑问题或提供您的问题

val path = "some/path/to/jsonFile.json"
spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json(path)

try following code if your json file not very big 如果您的json文件不是很大,请尝试以下代码

    val spark = SparkSession.builder().getOrCreate()
    val df = spark.read.json(spark.sparkContext.wholeTextFiles("some/path/to/jsonFile.json").values)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM