简体   繁体   中英

Spark dataframe map root key with elements of array of another column of string type

Actually I am stuck in a problem where I have a dataframe with 2 columns having schema

    scala> df1.printSchema
    root
     |-- actions: string (nullable = true)
     |-- event_id: string (nullable = true)

actions column actually contains as array of objects but it's type is string and hence I can't use explode here

Sample data:

------------------------------------------------------------------------------------------------------------------
| event_id |                                        actions                                                       |
------------------------------------------------------------------------------------------------------------------
|    1     | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]  |
------------------------------------------------------------------------------------------------------------------

There are some other keys present in each object of actions, but for simplicity I have taken 2 here.

I want to convert this to below format

OUTPUT:-

---------------------------------------
| event_id | name      |    score      |
---------------------------------------
|   1      | Vijay     |    843        |
---------------------------------------
|   2      | Manish    |    840        |
---------------------------------------
|   3      | Mayur     |    930        |
---------------------------------------

how can I do this with spark dataframe?

I have tried to read actions column using

val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))

but here I am not able to map event_id with each line.

You can do this by using the from_json function. This function has 2 inputs:

  • A column from which we want to read json string from
  • A schema with which to parse the json string

That would look something like this:

import spark.implicits._
import org.apache.spark.sql.types._

// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")

df.show(false)
+--------+---------------------------------------------------------------------------------------------------+                                                                                                                                                                  
|event_id|actions                                                                                            |                                                                                                                                                                  
+--------+---------------------------------------------------------------------------------------------------+                                                                                                                                                                  
|1       |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|                                                                                                                                                                  
+--------+---------------------------------------------------------------------------------------------------+


// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
  new StructType()
    .add("name", StringType)
    .add("score", IntegerType)
  )

// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
  .withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
  .drop("actions")
  .select("event_id", "parsedActions.*")

parsedDf.show(false)
+--------+------+-----+                                                                                                                                                                                                                                                         
|event_id|  name|score|                                                                                                                                                                                                                                                         
+--------+------+-----+                                                                                                                                                                                                                                                         
|       1| Vijay|  843|                                                                                                                                                                                                                                                         
|       1|Manish|  840|                                                                                                                                                                                                                                                         
|       1| Mayur|  930|                                                                                                                                                                                                                                                         
+--------+------+-----+

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM