简体   繁体   中英

Spark Dataset - map option[T] fields

I wonder how to work on datasets columns that are nullable (Option[T]). My goal is to use the spark dataset API (such "Map") and benefit from the compilation time typed advantages. (I do not want to use the dataframe API such "select")

Take this example: I d'like to apply functions on columns. This only works fine when the column is not nullable.

val schema = List(
    StructField("name", StringType, false)
  , StructField("age", IntegerType, true)
  , StructField("children", IntegerType, false)
)

val data = Seq(
  Row("miguel", null, 0),
  Row("luisa", 21, 1)
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

case class Person(name: String, age: Option[Int], children: Int)
//                                    ^
//                                    |
//                                 age is nullable
df.as[Person].map(x => x.children * 12).show
//+-----+
//|value|
//+-----+
//|    0|
//|   12|
//+-----+
df.as[Person].map(x => x.age * 12).show
//<console>:36: error: value * is not a member of Option[Int]
//       df.as[Person].map(x => x.age * 12).show

Can anybody point me on a easy way to multiply this nullable age column by 12 ?

Thanks

Since it is an Option you can transform it directly. Instead map :

df.as[Person].map(x => x.age.map(_ * 12)).show

// +-----+
// |value|
// +-----+
// | null|
// |  252|
// +-----+

In practice I'd just select :

df.select(($"age" * 12).as[Int]).show
// +----------+
// |(age * 12)|
// +----------+
// |      null|
// |       252|
// +----------+

It will perform better and when you call as[Person] you already loose most of the static type checking benefits.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM