[英]I need to create a spark dataframe from a nested json file in scala
I have a Json file that looks like this 我有一个看起来像这样的Json文件
{
"tags": [
{
"1": "NpProgressBarTag",
"2": "userPath",
"3": "screen",
"4": 6,
"12": 9,
"13": "buttonName",
"16": 0,
"17": 10,
"18": 5,
"19": 6,
"20": 1,
"35": 1,
"36": 1,
"37": 4,
"38": 0,
"39": "npChannelGuid",
"40": "npShowGuid",
"41": "npCategoryGuid",
"42": "npEpisodeGuid",
"43": "npAodEpisodeGuid",
"44": "npVodEpisodeGuid",
"45": "npLiveEventGuid",
"46": "npTeamGuid",
"47": "npLeagueGuid",
"48": "npStatus",
"50": 0,
"52": "gupId",
"54": "deviceID",
"55": 1,
"56": 0,
"57": "uiVersion",
"58": 1,
"59": "deviceOS",
"60": 1,
"61": 0,
"62": "channelLineupID",
"63": 2,
"64": "userProfile",
"65": "sessionId",
"66": "hitId",
"67": "actionTime",
"68": "seekTo",
"69": "seekFrom",
"70": "currentPosition"
}
]
}
I tried to create a dataframe using 我试图使用创建数据框
val path = "some/path/to/jsonFile.json"
val df = sqlContext.read.json(path)
df.show()
when I run this I get 当我运行这个我得到
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
How do we create a df based on contents of "tags" key? 我们如何基于“标签”键的内容创建df? all I need is, pull data out of "tags" and apply case class like this
我所需要的是,从“标签”中提取数据并像这样应用案例类
case class ProgLang (id: String, type: String )
I need to convert this json data into dataframe with Two Column names .toDF(id, Type) Can anyone shed some light on this error? 我需要将此json数据转换为具有两列名称的数据框.toDF(id,Type)有人可以阐明这个错误吗?
You may modify the JSON using Circe . 您可以使用Circe修改JSON。
Given that your values are sometimes Strings and other times Numbers , this was quite complex. 鉴于您的值有时是Strings而其他时候是Numbers ,这非常复杂。
import io.circe._, io.circe.parser._, io.circe.generic.semiauto._
val json = """ ... """ // your JSON here.
val doc = parse(json).right.get
val mappedDoc = doc.hcursor.downField("tags").withFocus { array =>
array.mapArray { jsons =>
jsons.map { json =>
json.mapObject { o =>
o.mapValues { v =>
// Cast numbers to strings.
if (v.isString) v else Json.fromString(v.asNumber.get.toString)
}
}
}
}
}
final case class ProgLang(id: String, `type`: String )
final case class Tags(tags: List[Map[String, String]])
implicit val TagsDecoder: Decoder[Tags] = deriveDecoder
val tags = mappedDoc.top.get.as[Tags]
val data = for {
tag <- res29.tags
(id, _type) <- tag
} yield ProgLang(id, _type)
Now you have a List of ProgLang
you may create a DataFrame
directly from it, save it as a file with each JSON per line, save it as a CSV file, etc... 现在您有了一个
ProgLang
列表,您可以直接从中创建一个DataFrame
,将其保存为每行每个JSON的文件,另存为CSV文件,等等。
If the file is very big, you may use fs2 to stream it while transforming, it integrates nicely with Circe . 如果文件很大,则可以在转换时使用fs2对其进行流传输,它与Circe可以很好地集成 。
DISCLAIMER: I am far from being a "pro" with Circe, this seems over-complicated for doing something which seems like a "simple-task", probably there is a better / cleaner way of doing it (maybe using Optics?), but hey! 免责声明:我远不是Circe的“专家”,这似乎因为做“简单任务”之类的事情而变得过于复杂,也许有更好/更干净的方法(也许使用光学?),但是,嘿! it works!
有用! - anyways, if anyone knows a better way to solve this feel free to edit the question or provide yours .
-无论如何,如果有人知道解决此问题的更好方法,请随时编辑问题或提供您的问题 。
val path = "some/path/to/jsonFile.json"
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json(path)
try following code if your json file not very big 如果您的json文件不是很大,请尝试以下代码
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("some/path/to/jsonFile.json").values)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.