Spark Dataset - map option[T] fields

Question

I wonder how to work on datasets columns that are nullable (Option[T]). My goal is to use the spark dataset API (such "Map") and benefit from the compilation time typed advantages. (I do not want to use the dataframe API such "select")

Take this example: I d'like to apply functions on columns. This only works fine when the column is not nullable.

val schema = List(
    StructField("name", StringType, false)
  , StructField("age", IntegerType, true)
  , StructField("children", IntegerType, false)
)

val data = Seq(
  Row("miguel", null, 0),
  Row("luisa", 21, 1)
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

case class Person(name: String, age: Option[Int], children: Int)
//                                    ^
//                                    |
//                                 age is nullable
df.as[Person].map(x => x.children * 12).show
//+-----+
//|value|
//+-----+
//|    0|
//|   12|
//+-----+
df.as[Person].map(x => x.age * 12).show
//<console>:36: error: value * is not a member of Option[Int]
//       df.as[Person].map(x => x.age * 12).show

Can anybody point me on a easy way to multiply this nullable age column by 12 ?

Thanks

Answer 1

Since it is an Option you can transform it directly. Instead map :

df.as[Person].map(x => x.age.map(_ * 12)).show

// +-----+
// |value|
// +-----+
// | null|
// |  252|
// +-----+

In practice I'd just select :

df.select(($"age" * 12).as[Int]).show
// +----------+
// |(age * 12)|
// +----------+
// |      null|
// |       252|
// +----------+

It will perform better and when you call as[Person] you already loose most of the static type checking benefits.

Spark Dataset - map option[T] fields

Question

1 answers

solution1
0 ACCPTED 2018-02-11 14:51:42

Spark Dataset - map option[T] fields

Question

1 answers

solution1 0 ACCPTED 2018-02-11 14:51:42

solution1
0 ACCPTED 2018-02-11 14:51:42