![](/img/trans.png)
[英]Spark: Recursive 'ArrayType Column => ArrayType Column' function
[英]Spark 2.0.1: split JSON Array Column into ArrayType(StringType)
我有一个这样的数据框
root
|-- sum_id: long (nullable = true)
|-- json: string (nullable = true)
+-------+------------------------------+
|sum_id |json |
+-------+------------------------------+
|8124455|[{"itemId":11},{"itemId":12}] |
|8124457|[{"itemId":53}] |
|8124458|[{"itemId":11},{"itemId":33}] |
+-------+------------------------------+
我想和Scala一起爆发
root
|-- sum_id: long (nullable = true)
|-- itemId: int(nullable = true)
+-------+--------+
|sum_id |itemId |
+-------+--------+
|8124455|11 |
|8124455|12 |
|8124457|53 |
|8124458|11 |
|8124458|33 |
+-------+--------+
我尝试了什么:
使用get_json_object
,但是该列是JSON对象的数组,因此我认为应该首先将其分解为对象,但是如何?
尝试将列json
从StringType
为ArrayType(StringType)
,但出现data type mismatch
异常。
请指导我如何解决此问题。
下面的代码将精确地完成您的工作。
val toItemArr = udf((jsonArrStr:String) => {
jsonArrStr.replace("[","").replace("]","").split(",")
})
inputDataFrame.withColumn("itemId",explode(toItemArr(get_json_object(col("json"),"$[*].itemId")))).drop("json").show
+-------+------+
| id|itemId|
+-------+------+
|8124455| 11|
|8124455| 12|
|8124457| 53|
|8124458| 11|
|8124458| 33|
+-------+------+
因为您正在使用Json,所以这可能是最好的方法:
请看一下:
import org.apache.spark._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature
val df = sc.parallelize(Seq((8124455,"""[{"itemId":11},{"itemId":12}]"""),(8124457,"""[{"itemId":53}]"""),(8124458,"""[{"itemId":11},{"itemId":33}]"""))).toDF("sum_id","json")
val result = df.rdd.mapPartitions(records => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
val values=records.flatMap(record => {
try {
Some((record.getInt(0),mapper.readValue(record.getString(1), classOf[List[Map[String,Int]]]).map(_.map(_._2).toList).flatten))
} catch {
case e: Exception => None
}
})
values.flatMap(listOfList=>listOfList._2.map(a=>(listOfList._1,a)))
}, true)
result.toDF.show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.