I wonder how to work on datasets columns that are nullable (Option[T]). My goal is to use the spark dataset API (such "Map") and benefit from the compilation time typed advantages. (I do not want to use the dataframe API such "select")
Take this example: I d'like to apply functions on columns. This only works fine when the column is not nullable.
val schema = List(
StructField("name", StringType, false)
, StructField("age", IntegerType, true)
, StructField("children", IntegerType, false)
)
val data = Seq(
Row("miguel", null, 0),
Row("luisa", 21, 1)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
case class Person(name: String, age: Option[Int], children: Int)
// ^
// |
// age is nullable
df.as[Person].map(x => x.children * 12).show
//+-----+
//|value|
//+-----+
//| 0|
//| 12|
//+-----+
df.as[Person].map(x => x.age * 12).show
//<console>:36: error: value * is not a member of Option[Int]
// df.as[Person].map(x => x.age * 12).show
Can anybody point me on a easy way to multiply this nullable age column by 12 ?
Thanks
Since it is an Option
you can transform it directly. Instead map
:
df.as[Person].map(x => x.age.map(_ * 12)).show
// +-----+
// |value|
// +-----+
// | null|
// | 252|
// +-----+
In practice I'd just select
:
df.select(($"age" * 12).as[Int]).show
// +----------+
// |(age * 12)|
// +----------+
// | null|
// | 252|
// +----------+
It will perform better and when you call as[Person]
you already loose most of the static type checking benefits.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.