[英]Spark 2.0.1: split JSON Array Column into ArrayType(StringType)
I have a Data Frame like this 我有一个这样的数据框
root
|-- sum_id: long (nullable = true)
|-- json: string (nullable = true)
+-------+------------------------------+
|sum_id |json |
+-------+------------------------------+
|8124455|[{"itemId":11},{"itemId":12}] |
|8124457|[{"itemId":53}] |
|8124458|[{"itemId":11},{"itemId":33}] |
+-------+------------------------------+
and I would like to explode into this with Scala 我想和Scala一起爆发
root
|-- sum_id: long (nullable = true)
|-- itemId: int(nullable = true)
+-------+--------+
|sum_id |itemId |
+-------+--------+
|8124455|11 |
|8124455|12 |
|8124457|53 |
|8124458|11 |
|8124458|33 |
+-------+--------+
Whats I tried: 我尝试了什么:
Using get_json_object
, but the column is an array of JSON objects, so I think it should be explode into object first, but how? 使用get_json_object
,但是该列是JSON对象的数组,因此我认为应该首先将其分解为对象,但是如何?
Tried to cast column json
from StringType
to ArrayType(StringType)
, but got data type mismatch
exceptions. 尝试将列json
从StringType
为ArrayType(StringType)
,但出现data type mismatch
异常。
Please guide me how to solve this problem. 请指导我如何解决此问题。
Below code will do your work precisely. 下面的代码将精确地完成您的工作。
val toItemArr = udf((jsonArrStr:String) => {
jsonArrStr.replace("[","").replace("]","").split(",")
})
inputDataFrame.withColumn("itemId",explode(toItemArr(get_json_object(col("json"),"$[*].itemId")))).drop("json").show
+-------+------+
| id|itemId|
+-------+------+
|8124455| 11|
|8124455| 12|
|8124457| 53|
|8124458| 11|
|8124458| 33|
+-------+------+
AS you are using Json , so this might be the best approach : 因为您正在使用Json,所以这可能是最好的方法:
Please take a look: 请看一下:
import org.apache.spark._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature
val df = sc.parallelize(Seq((8124455,"""[{"itemId":11},{"itemId":12}]"""),(8124457,"""[{"itemId":53}]"""),(8124458,"""[{"itemId":11},{"itemId":33}]"""))).toDF("sum_id","json")
val result = df.rdd.mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
val values=records.flatMap(record => {
try {
Some((record.getInt(0),mapper.readValue(record.getString(1), classOf[List[Map[String,Int]]]).map(_.map(_._2).toList).flatten))
} catch {
case e: Exception => None
}
})
values.flatMap(listOfList=>listOfList._2.map(a=>(listOfList._1,a)))
}, true)
result.toDF.show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.