[英]Save spark DataFrame to csv file with map<string,string> column type
[英]Spark dataframe map root key with elements of array of another column of string type
實際上我遇到了一個問題,我有一個 dataframe,其中 2 列具有架構
scala> df1.printSchema
root
|-- actions: string (nullable = true)
|-- event_id: string (nullable = true)
actions 列實際上包含對象數組,但它的類型是字符串,因此我不能在這里使用 explode
樣本數據:
------------------------------------------------------------------------------------------------------------------
| event_id | actions |
------------------------------------------------------------------------------------------------------------------
| 1 | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}] |
------------------------------------------------------------------------------------------------------------------
每個 object 操作中還有一些其他鍵,但為簡單起見,我在這里取了 2 個。
我想將其轉換為以下格式
OUTPUT:-
---------------------------------------
| event_id | name | score |
---------------------------------------
| 1 | Vijay | 843 |
---------------------------------------
| 2 | Manish | 840 |
---------------------------------------
| 3 | Mayur | 930 |
---------------------------------------
我如何使用 spark dataframe 執行此操作?
我嘗試使用閱讀操作列
val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))
但在這里我無法在每一行中使用 map event_id。
您可以使用from_json
function 來執行此操作。此 function 有 2 個輸入:
這看起來像這樣:
import spark.implicits._
import org.apache.spark.sql.types._
// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")
df.show(false)
+--------+---------------------------------------------------------------------------------------------------+
|event_id|actions |
+--------+---------------------------------------------------------------------------------------------------+
|1 |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|
+--------+---------------------------------------------------------------------------------------------------+
// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
new StructType()
.add("name", StringType)
.add("score", IntegerType)
)
// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
.withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
.drop("actions")
.select("event_id", "parsedActions.*")
parsedDf.show(false)
+--------+------+-----+
|event_id| name|score|
+--------+------+-----+
| 1| Vijay| 843|
| 1|Manish| 840|
| 1| Mayur| 930|
+--------+------+-----+
希望這可以幫助!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.