Spark 2.0.1：将JSON数组列拆分为ArrayType（StringType）

Question

I have a Data Frame like this 我有一个这样的数据框

root
 |-- sum_id: long (nullable = true)
 |-- json: string (nullable = true)

+-------+------------------------------+
|sum_id |json                          |
+-------+------------------------------+
|8124455|[{"itemId":11},{"itemId":12}] |
|8124457|[{"itemId":53}]               |
|8124458|[{"itemId":11},{"itemId":33}] |
+-------+------------------------------+

and I would like to explode into this with Scala 我想和Scala一起爆发

root
 |-- sum_id: long (nullable = true)
 |-- itemId: int(nullable = true)

+-------+--------+
|sum_id |itemId  |
+-------+--------+
|8124455|11      |
|8124455|12      |
|8124457|53      |
|8124458|11      |
|8124458|33      |
+-------+--------+

Whats I tried: 我尝试了什么：

Using get_json_object , but the column is an array of JSON objects, so I think it should be explode into object first, but how? 使用get_json_object ，但是该列是JSON对象的数组，因此我认为应该首先将其分解为对象，但是如何？
Tried to cast column json from StringType to ArrayType(StringType) , but got data type mismatch exceptions. 尝试将列json从StringType为ArrayType(StringType) ，但出现data type mismatch异常。

Please guide me how to solve this problem. 请指导我如何解决此问题。

Answer 1

Below code will do your work precisely. 下面的代码将精确地完成您的工作。

val toItemArr = udf((jsonArrStr:String) => {
      jsonArrStr.replace("[","").replace("]","").split(",")
   })

inputDataFrame.withColumn("itemId",explode(toItemArr(get_json_object(col("json"),"$[*].itemId")))).drop("json").show


+-------+------+
|     id|itemId|
+-------+------+ 
|8124455|    11|
|8124455|    12|
|8124457|    53|
|8124458|    11|
|8124458|    33|
+-------+------+

Answer 2

AS you are using Json , so this might be the best approach : 因为您正在使用Json，所以这可能是最好的方法：

Please take a look: 请看一下：

import org.apache.spark._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.databind.DeserializationFeature

val df = sc.parallelize(Seq((8124455,"""[{"itemId":11},{"itemId":12}]"""),(8124457,"""[{"itemId":53}]"""),(8124458,"""[{"itemId":11},{"itemId":33}]"""))).toDF("sum_id","json")
val result = df.rdd.mapPartitions(records => {
        val mapper = new ObjectMapper with ScalaObjectMapper
        mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
        mapper.registerModule(DefaultScalaModule)
      val values=records.flatMap(record => {
          try {
            Some((record.getInt(0),mapper.readValue(record.getString(1), classOf[List[Map[String,Int]]]).map(_.map(_._2).toList).flatten))
          } catch {
            case e: Exception => None
          }
        })
values.flatMap(listOfList=>listOfList._2.map(a=>(listOfList._1,a)))
    }, true)

result.toDF.show()

Spark 2.0.1：将JSON数组列拆分为ArrayType（StringType）

问题描述

2 个解决方案

解决方案1
0 已采纳 2016-12-20 07:28:59

解决方案2
0 2016-12-20 07:46:04

Spark 2.0.1：将JSON数组列拆分为ArrayType（StringType）

问题描述

2 个解决方案

解决方案1 0 已采纳 2016-12-20 07:28:59

解决方案2 0 2016-12-20 07:46:04

解决方案1
0 已采纳 2016-12-20 07:28:59

解决方案2
0 2016-12-20 07:46:04