Scala Spark: Dataset with JSON columns

Question

Hello from a Spark beginner!

I have a DataFrame that includes several columns, let's say ID, name, and properties. All of them are of type string. The last column, properties, includes a JSON representation of some properties of the object.

I am looking for some way to iterate over the DataFrame, parse the JSON, and extract a specific JSON field out of each item - and append that to the row of the DataFrame.

So far, a bit lost - I know that Spark can import JSON datasets (that's not what I have..) and that there's a net.liftweb.json library, but unfortunately I haven't found a way to make it work -

val users = sqlContext.table("user")
  .withColumn("parsedProperties", parse($"properties"))

returns a TypeMismatch - parse() function expects a String, and i'm sending it a column name.

Note that I do NOT have a set schema for this JSON column.

Thank you in advance!

Answer 1

You need to create a udf here, from the function parse, and then apply the udf on the column.

import org.apache.spark.sql.functions.udf
val parse_udf = udf( parse _ )

val users = sqlContext.table("user")
  .withColumn("parsedProperties", parse_udf($"properties"))

Answer 2

Working now! Thank you!

val getEmail: String => String = parse(_).asInstanceOf[JObject].values.getOrElse("email", "").toString 
val getEmailUDF = udf(getEmail)
val users = sqlContext.table("user")
  .withColumn("email", getEmailUDF($"properties"))

Scala Spark: Dataset with JSON columns

Question

2 answers

solution1
1 2017-03-14 02:43:11

solution2
0 2017-03-14 03:13:08

Scala Spark: Dataset with JSON columns

Question

2 answers

solution1 1 2017-03-14 02:43:11

solution2 0 2017-03-14 03:13:08

solution1
1 2017-03-14 02:43:11

solution2
0 2017-03-14 03:13:08