简体   繁体   English

如何将DataFrame中的struct映射到case类?

[英]How to map struct in DataFrame to case class?

At some point in my application, I have a DataFrame with a Struct field created from a case class. 在我的应用程序中的某个时刻,我有一个DataFrame,其中包含从案例类创建的Struct字段。 Now I want to cast/map it back to the case class type: 现在我想将它转换/映射回case类类型:

import spark.implicits._
case class Location(lat: Double, lon: Double)

scala> Seq((10, Location(35, 25)), (20, Location(45, 35))).toDF
res25: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<lat: double, lon: double>]

scala> res25.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

And basic: 基本的:

res25.map(r => {
   Location(r.getStruct(1).getDouble(0), r.getStruct(1).getDouble(1))
}).show(1)

Looks really dirty Is there any simpler way? 看起来很脏有没有更简单的方法?

In Spark 1.6+ if you want to retain the type information retained, then use Dataset (DS), not DataFrame (DF). 在Spark 1.6+中,如果要保留保留的类型信息,请使用数据集(DS),而不是DataFrame(DF)。

import spark.implicits._
case class Location(lat: Double, lon: Double)

scala> Seq((10, Location(35, 25)), (20, Location(45, 35))).toDS
res25: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]

scala> res25.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

It will give you Dataset[(Int, Location)] . 它会给你Dataset[(Int, Location)] Now, if you want to get back to it's case class origin again, then simply do like this: 现在,如果你想再次回到它的case类原点,那么就这样做:

scala> res25.map(r => r._2).show(1)
+----+----+
| lat| lon|
+----+----+
|35.0|25.0|
+----+----+

But, if you want to stick to DataFrame API, due it's to dynamic type nature, then you have to you have to code it like this: 但是,如果你想坚持DataFrame API,因为它是动态类型的性质,那么你必须像这样编码它:

scala> res25.select("_2.*").map(r => Location(r.getDouble(0), r.getDouble(1))).show(1)
+----+----+
| lat| lon|
+----+----+
|35.0|25.0|
+----+----+

You could also use the extractor pattern in Row that would give you similar results, using more idiomatic scala: 您还可以使用Row中的提取器模式,使用更多惯用的scala来提供类似的结果:

scala> res25.map { row =>
  (row: @unchecked) match {
    case Row(a: Int, Row(b: Double, c: Double)) => (a, Location(b, c))
  }
}
res26: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]
scala> res26.collect()
res27: Array[(Int, Location)] = Array((10,Location(35.0,25.0)), (20,Location(45.0,35.0)))

I think the other answers nailed it, but perhaps they may need some other wording. 我认为其他答案已经确定了,但也许他们可能需要一些其他的措辞。

In short, it's not possible to use case classes in DataFrames since they don't case about case classes and use RowEncoder to map internal SQL types to a Row . 简而言之,不可能在DataFrames中使用case类,因为它们不涉及case类并且使用RowEncoder将内部SQL类型映射到Row

As the other answers said, you have to turn Row -based DataFrame into a Dataset using as operator. 正如其他答案所说,您必须使用as运算符将基于RowDataFrame转换为Dataset

val df = Seq((10, Location(35, 25)), (20, Location(45, 35))).toDF
scala> val ds = df.as[(Int, Location)]
ds: org.apache.spark.sql.Dataset[(Int, Location)] = [_1: int, _2: struct<lat: double, lon: double>]

scala> ds.show
+---+-----------+
| _1|         _2|
+---+-----------+
| 10|[35.0,25.0]|
| 20|[45.0,35.0]|
+---+-----------+

scala> ds.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: struct (nullable = true)
 |    |-- lat: double (nullable = false)
 |    |-- lon: double (nullable = false)

scala> ds.map[TAB pressed twice]

def map[U](func: org.apache.spark.api.java.function.MapFunction[(Int, Location),U],encoder: org.apache.spark.sql.Encoder[U]): org.apache.spark.sql.Dataset[U]
def map[U](func: ((Int, Location)) => U)(implicit evidence$6: org.apache.spark.sql.Encoder[U]): org.apache.spark.sql.Dataset[U]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM