[英]How to read a Nested JSON in Spark Scala?
Here is my Nested JSON file.这是我的嵌套 JSON 文件。
{
"dc_id": "dc-101",
"source": {
"sensor-igauge": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo": {"lat":38.00, "long":97.00}
},
"sensor-ipad": {
"id": 13,
"ip": "67.185.72.1",
"description": "Sensor ipad attached to carbon cylinders",
"temp": 34,
"c02_level": 1370,
"geo": {"lat":47.41, "long":-122.00}
},
"sensor-inest": {
"id": 8,
"ip": "208.109.163.218",
"description": "Sensor attached to the factory ceilings",
"temp": 40,
"c02_level": 1346,
"geo": {"lat":33.61, "long":-111.89}
},
"sensor-istick": {
"id": 5,
"ip": "204.116.105.67",
"description": "Sensor embedded in exhaust pipes in the ceilings",
"temp": 40,
"c02_level": 1574,
"geo": {"lat":35.93, "long":-85.46}
}
}
}
How can I read the JSON file into Dataframe with Spark Scala.如何使用 Spark Scala 将 JSON 文件读入 Dataframe。 There is no array object in the JSON file, so I can't use explode.
JSON文件中没有数组对象,所以不能使用explode。 Can anyone help?
谁能帮忙?
val df = spark.read.option("multiline", true).json("data/test.json")
df
.select(col("dc_id"), explode(array("source.*")) as "level1")
.withColumn("id", col("level1.id"))
.withColumn("ip", col("level1.ip"))
.withColumn("temp", col("level1.temp"))
.withColumn("description", col("level1.description"))
.withColumn("c02_level", col("level1.c02_level"))
.withColumn("lat", col("level1.geo.lat"))
.withColumn("long", col("level1.geo.long"))
.drop("level1")
.show(false)
Sample Output:示例输出:
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc_id |id |ip |temp|description |c02_level|lat |long |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc-101|10 |68.28.91.22 |35 |Sensor attached to the container ceilings |1475 |38.0 |97.0 |
|dc-101|8 |208.109.163.218|40 |Sensor attached to the factory ceilings |1346 |33.61|-111.89|
|dc-101|13 |67.185.72.1 |34 |Sensor ipad attached to carbon cylinders |1370 |47.41|-122.0 |
|dc-101|5 |204.116.105.67 |40 |Sensor embedded in exhaust pipes in the ceilings|1574 |35.93|-85.46 |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
Instead of selecting each column, you can try writing some generic UDF to get all the individual columns.您可以尝试编写一些通用 UDF 来获取所有单独的列,而不是选择每一列。
Note: Tested with Spark 2.3注意:使用 Spark 2.3 测试
Taken the string into a variable called jsonString将字符串放入一个名为 jsonString 的变量中
import org.apache.spark.sql._
import spark.implicits._
val df = spark.read.json(Seq(jsonString).toDS)
val df1 = df.withColumn("lat" ,explode(array("source.sensor-igauge.geo.lat")))
You can follow the same steps for other structures as well - map/ array structures您也可以对其他结构执行相同的步骤 - 映射/数组结构
val df = spark.read.option("multiline", true).json("myfile.json")
df.select($"dc_id", explode(array("source.*")))
.select($"dc_id", $"col.c02_level", $"col.description", $"col.geo.lat", $"col.geo.long", $"col.id", $"col.ip", $"col.temp")
.show(false)
Output:输出:
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc_id |c02_level|description |lat |long |id |ip |temp|
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc-101|1475 |Sensor attached to the container ceilings |38.0 |97.0 |10 |68.28.91.22 |35 |
|dc-101|1346 |Sensor attached to the factory ceilings |33.61|-111.89|8 |208.109.163.218|40 |
|dc-101|1370 |Sensor ipad attached to carbon cylinders |47.41|-122.0 |13 |67.185.72.1 |34 |
|dc-101|1574 |Sensor embedded in exhaust pipes in the ceilings|35.93|-85.46 |5 |204.116.105.67 |40 |
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.