简体   繁体   English

在Spark和Scala中转换数据框架构

[英]cast schema of a data frame in Spark and Scala

I want to cast the schema of a dataframe to change the type of some columns using Spark and Scala. 我想使用Spark和Scala来转换数据帧的模式以更改某些列的类型。

Specifically I am trying to use as[U] function whose description reads: " Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U " 具体来说,我试图使用[U]函数,其描述如下:“ 返回一个新的数据集,其中每个记录已映射到指定的类型。用于映射列的方法取决于U的类型

In principle this is exactly what I want, but I cannot get it to work. 原则上这正是我想要的,但我无法让它发挥作用。

Here is a simple example taken from https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 这是一个简单的例子来自https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala



    // definition of data
    val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")

As expected the schema of data is: 正如预期的那样,数据模式是:

root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

I would like to cast the column "b" to Double. 我想将列“b”强制转换为Double。 So I try the following: 所以我尝试以下方法:



    import session.implicits._;

    println(" --------------------------- Casting using (String Double)")

    val data_TupleCast=data.as[(String, Double)]
    data_TupleCast.show()
    data_TupleCast.printSchema()

    println(" --------------------------- Casting using ClassData_Double")

    case class ClassData_Double(a: String, b: Double)

    val data_ClassCast= data.as[ClassData_Double]
    data_ClassCast.show()
    data_ClassCast.printSchema()

As I understand the definition of as[u], the new DataFrames should have the following schema 据我所知,as [u]的定义,新的DataFrames应具有以下架构

root
     |-- a: string (nullable = true)
     |-- b: double (nullable = false)

But the output is 但输出是

--------------------------- Casting using (String Double)
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

     --------------------------- Casting using ClassData_Double
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

which shows that column "b" has not been cast to double. 这表明列“b”尚未被强制转换为双倍。

Any hints on what I am doing wrong? 关于我做错了什么的暗示?

BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" 顺便说一句:我知道上一篇文章“如何在Spark SQL的DataFrame中更改列类型?” (see How to change column types in Spark SQL's DataFrame? ). (请参阅如何更改Spark SQL的DataFrame中的列类型? )。 I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process). 我知道我可以一次更改一个列的类型,但我正在寻找一个更通用的解决方案,一次性改变整个数据的模式(我试图在过程中理解Spark)。

Well, since functions are chained and Spark does lazy evaluation, it actually does change the schema of the whole data in one shot, even if you do write it as changing one column at the time like this: 好吧,因为函数是链接的,Spark做了懒惰的评估,它实际上确实一次性改变了整个数据的模式,即使你把它写成当时更改一列,如下所示:

import spark.implicits._

df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...

As an alternative, I'm thinking you could use map to do your cast in one go, like: 作为替代方案,我认为你可以使用map来一次完成你的演员,比如:

df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM