Spark dataframe map 根鍵與字符串類型的另一列數組的元素

Question

實際上我遇到了一個問題，我有一個 dataframe，其中 2 列具有架構

    scala> df1.printSchema
    root
     |-- actions: string (nullable = true)
     |-- event_id: string (nullable = true)

actions 列實際上包含對象數組，但它的類型是字符串，因此我不能在這里使用 explode

樣本數據：

------------------------------------------------------------------------------------------------------------------
| event_id |                                        actions                                                       |
------------------------------------------------------------------------------------------------------------------
|    1     | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]  |
------------------------------------------------------------------------------------------------------------------

每個 object 操作中還有一些其他鍵，但為簡單起見，我在這里取了 2 個。

我想將其轉換為以下格式

OUTPUT:-

---------------------------------------
| event_id | name      |    score      |
---------------------------------------
|   1      | Vijay     |    843        |
---------------------------------------
|   2      | Manish    |    840        |
---------------------------------------
|   3      | Mayur     |    930        |
---------------------------------------

我如何使用 spark dataframe 執行此操作？

我嘗試使用閱讀操作列

val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))

但在這里我無法在每一行中使用 map event_id。

Answer 1

您可以使用from_json function 來執行此操作。此 function 有 2 個輸入：

我們要從中讀取 json 字符串的列
用於解析 json 字符串的架構

這看起來像這樣：

import spark.implicits._
import org.apache.spark.sql.types._

// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")

df.show(false)
+--------+---------------------------------------------------------------------------------------------------+                                                                                                                                                                  
|event_id|actions                                                                                            |                                                                                                                                                                  
+--------+---------------------------------------------------------------------------------------------------+                                                                                                                                                                  
|1       |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|                                                                                                                                                                  
+--------+---------------------------------------------------------------------------------------------------+


// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
  new StructType()
    .add("name", StringType)
    .add("score", IntegerType)
  )

// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
  .withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
  .drop("actions")
  .select("event_id", "parsedActions.*")

parsedDf.show(false)
+--------+------+-----+                                                                                                                                                                                                                                                         
|event_id|  name|score|                                                                                                                                                                                                                                                         
+--------+------+-----+                                                                                                                                                                                                                                                         
|       1| Vijay|  843|                                                                                                                                                                                                                                                         
|       1|Manish|  840|                                                                                                                                                                                                                                                         
|       1| Mayur|  930|                                                                                                                                                                                                                                                         
+--------+------+-----+

希望這可以幫助！

Spark dataframe map 根鍵與字符串類型的另一列數組的元素

問題描述

1 個解決方案

解決方案1
0 已采納 2022-12-07 07:33:42

Spark dataframe map 根鍵與字符串類型的另一列數組的元素

問題描述

1 個解決方案

解決方案1 0 已采納 2022-12-07 07:33:42

解決方案1
0 已采納 2022-12-07 07:33:42