When I retrieve a dataset in Spark 2, using a select statement the underlying columns inherit the data types of the queried columns.
val ds1 = spark.sql("select 1 as a, 2 as b, 'abd' as c")
ds1.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
Now if I convert this into a case class, it will correctly convert the values, but the underlying schema is still wrong.
case class abc(a: Double, b: Double, c: String)
val ds2 = ds1.as[abc]
ds2.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
ds2.collect
res18: Array[abc] = Array(abc(1.0,2.0,abd))
I "SHOULD" be able to specify the encoder to use when I create the second dataset, but scala seems to ignore this parameter (Is this a BUG?):
val abc_enc = org.apache.spark.sql.Encoders.product[abc]
val ds2 = ds1.as[abc](abc_enc)
ds2.printSchema
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
So the only way I can see to do this simply, without very complex mapping is to use createDataset, but this requires a collect on the underlying object, so it's not ideal.
val ds2 = spark.createDataset(ds1.as[abc].collect)
This is an open issue in Spark API (check this ticket SPARK-17694 )
So what you need to do is doing an extra explicit cast. Something like this should work:
ds1.as[abc].map(x => x : abc)
You can simply use cast
method on columns
as
import sqlContext.implicits._
val ds2 = ds1.select($"a".cast(DoubleType), $"a".cast(DoubleType), $"c")
ds2.printSchema()
you should have
root
|-- a: double (nullable = false)
|-- a: double (nullable = false)
|-- c: string (nullable = false)
You could also cast the column while selecting with sql query as below
import spark.implicits._
val ds = Seq((1,2,"abc"),(1,2,"abc")).toDF("a", "b","c").createOrReplaceTempView("temp")
val ds1 = spark.sql("select cast(a as Double) , cast (b as Double), c from temp")
ds1.printSchema()
This have the schema as
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
Now you can convert to Dataset with case class
case class abc(a: Double, b: Double, c: String)
val ds2 = ds1.as[abc]
ds2.printSchema()
Which now has the required schema
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
Hope this helps!
OK, I think I've resolved this in a better way.
Instead of using a collect when we create a new dataset, we can just reference the rdd of the dataset.
So instead of
val ds2 = spark.createDataset(ds1.as[abc].collect)
We use:
val ds2 = spark.createDataset(ds1.as[abc].rdd)
ds2.printSchema
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
This keeps the lazy evaluation intact, but allows the new dataset to use the Encoder for the abc case class, and the subsequent schema will reflect this when we use it to create a new table.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.