简体   繁体   中英

How do I change the schema on a Spark Dataset

When I retrieve a dataset in Spark 2, using a select statement the underlying columns inherit the data types of the queried columns.

val ds1 = spark.sql("select 1 as a, 2 as b, 'abd' as c")

ds1.printSchema()
root
 |-- a: integer (nullable = false)
 |-- b: integer (nullable = false)
 |-- c: string (nullable = false)

Now if I convert this into a case class, it will correctly convert the values, but the underlying schema is still wrong.

case class abc(a: Double, b: Double, c: String)
val ds2 = ds1.as[abc]
ds2.printSchema()
root
 |-- a: integer (nullable = false)
 |-- b: integer (nullable = false)
 |-- c: string (nullable = false)

ds2.collect
res18: Array[abc] = Array(abc(1.0,2.0,abd))

I "SHOULD" be able to specify the encoder to use when I create the second dataset, but scala seems to ignore this parameter (Is this a BUG?):

val abc_enc = org.apache.spark.sql.Encoders.product[abc]

val ds2 = ds1.as[abc](abc_enc)

ds2.printSchema
root
 |-- a: integer (nullable = false)
 |-- b: integer (nullable = false)
 |-- c: string (nullable = false)

So the only way I can see to do this simply, without very complex mapping is to use createDataset, but this requires a collect on the underlying object, so it's not ideal.

val ds2 = spark.createDataset(ds1.as[abc].collect)

This is an open issue in Spark API (check this ticket SPARK-17694 )

So what you need to do is doing an extra explicit cast. Something like this should work:

ds1.as[abc].map(x => x : abc)

You can simply use cast method on columns as

import sqlContext.implicits._
val ds2 = ds1.select($"a".cast(DoubleType), $"a".cast(DoubleType), $"c")
ds2.printSchema()

you should have

root
 |-- a: double (nullable = false)
 |-- a: double (nullable = false)
 |-- c: string (nullable = false)

You could also cast the column while selecting with sql query as below

import spark.implicits._

val ds = Seq((1,2,"abc"),(1,2,"abc")).toDF("a", "b","c").createOrReplaceTempView("temp")

val ds1 = spark.sql("select cast(a as Double) , cast (b as Double), c from temp")

ds1.printSchema()

This have the schema as

root
 |-- a: double (nullable = false)
 |-- b: double (nullable = false)
 |-- c: string (nullable = true)

Now you can convert to Dataset with case class

case class abc(a: Double, b: Double, c: String)

val ds2 = ds1.as[abc]
ds2.printSchema()

Which now has the required schema

root
 |-- a: double (nullable = false)
 |-- b: double (nullable = false)
 |-- c: string (nullable = true)

Hope this helps!

OK, I think I've resolved this in a better way.

Instead of using a collect when we create a new dataset, we can just reference the rdd of the dataset.

So instead of

val ds2 = spark.createDataset(ds1.as[abc].collect)

We use:

val ds2 = spark.createDataset(ds1.as[abc].rdd)

ds2.printSchema
root
 |-- a: double (nullable = false)
 |-- b: double (nullable = false)
 |-- c: string (nullable = true)

This keeps the lazy evaluation intact, but allows the new dataset to use the Encoder for the abc case class, and the subsequent schema will reflect this when we use it to create a new table.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM