[英]How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame
I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:我正在学习 Scala,并试图从一个大的嵌套 json 文件中过滤 select 几列,以生成 DataFrame。这是 json 的要点:
{
“meta”:
{“a”: 1, b: 2”} // I want to ignore meta
“objects”:
[
{
“caucus”: “Progressive”,
“person” :
{
“name”: “Mary”,
“party”: “Green Party”,
“age”: 50,
“gender”: “female” // etc..
}
}, // etc.
]
}
Hence the data looks like this as is, read in with spark.因此,数据看起来像这样,用 spark 读入。
val df = spark.read.json("file")
df.show()
+--------------------+--------------------+
| meta| objects|
+--------------------+--------------------+
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
+--------------------+--------------------+
Instead of this, I want a DataFrame with the columns: Name |而不是这个,我想要一个 DataFrame 的列:名称 | Party |
聚会 | Caucus.
党团会议。
I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.我搞砸了 explode() 并将架构复制为 StructType(),但我不确定如何处理一般的嵌套结构。
You can use ".*" on a column of type struct to tranform it it into multiple fields columns:您可以在 struct 类型的列上使用“.*”将其转换为多个字段列:
val df = spark.read.json("file.json")
df.select(col("meta"), explode(col("objects")).as("objects"))
.select("meta.*", "objects.*")
.select("a", "b", "caucus", "person.*")
.show(false)
+---+---+-----------+---+------+----+-----------+
|a |b |caucus |age|gender|name|party |
+---+---+-----------+---+------+----+-----------+
|1 |2 |Progressive|50 |female|Mary|Green Party|
+---+---+-----------+---+------+----+-----------+
There's no generic way to handle it because of course it depends on the shape of your data.没有通用的方法来处理它,因为它当然取决于数据的形状。 In your case, you want to explode an array, which will create a column called
col
, that will contain structs.在您的情况下,您想要分解一个数组,这将创建一个名为
col
的列,该列将包含结构。 You can then access the fields within the struct using the dot notation, so to extract the fields you asked for you can do this:然后,您可以使用点表示法访问结构中的字段,因此要提取您要求的字段,您可以这样做:
df.select(explode_outer($"objects")).
select(
$"col.caucus",
$"col.person.name",
$"col.person.party").show
+-----------+----+-----------+
| caucus|name| party|
+-----------+----+-----------+
|Progressive|Mary|Green Party|
+-----------+----+-----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.