[英]Spark Scala Trouble Converting DataFrame To DataSet
我有以下具有以下架构的数据框
db.printSchema()
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- id: string (nullable = true)
|-- sparse_rep: struct (nullable = true)
| |-- 1: double (nullable = true)
| |-- 10: double (nullable = true)
| |-- 11: double (nullable = true)
| |-- 12: double (nullable = true)
| |-- 13: double (nullable = true)
| |-- 14: double (nullable = true)
| |-- 15: double (nullable = true)
| |-- 17: double (nullable = true)
| |-- 18: double (nullable = true)
| |-- 2: double (nullable = true)
| |-- 20: double (nullable = true)
| |-- 21: double (nullable = true)
| |-- 22: double (nullable = true)
| |-- 23: double (nullable = true)
| |-- 24: double (nullable = true)
| |-- 25: double (nullable = true)
| |-- 26: double (nullable = true)
| |-- 27: double (nullable = true)
| |-- 3: double (nullable = true)
| |-- 4: double (nullable = true)
| |-- 7: double (nullable = true)
| |-- 9: double (nullable = true)
|-- title: string (nullable = true)
除了 sparse_rep 之外,这里所有的 id 看起来都很简单。 这个 sparse_rep object 最初在 Spark 中创建为 Map[Int,Double] object,然后写入 mongoDB。
但是,当我尝试使用数据集将其强制执行回 Map[Int,Double]
case class blogRow(_id:String, id:Int, sparse_rep:Map[Int,Double],title:String)
val blogRowEncoder = Encoders.product[blogRow]
db.as[blogRow](blogRowEncoder)
我收到以下错误。
Caused by: org.apache.spark.sql.AnalysisException: need a map field but got struct<1:double,10:double,11:double,12:double,13:double,14:double,15:double,17:double,18:double,2:double,20:double,21:double,22:double,23:double,24:double,25:double,26:double,27:double,3:double,4:double,7:double,9:double>;
将struct
类型转换为map
类型,然后使用用例 class。
DataFrame
和字段中的数据架构case class
应该匹配。
检查下面的代码。
scala> case class blogRow(_id:String, id:Int, sparse_rep:Map[Int,Double],title:String)
defined class blogRow
scala> val blogRowDF = df
.withColumn("sparse_rep",map(
df
.select("sparse_rep.*")
.columns
.flatMap(c => List(lit(c).cast("int"),col(s"sparse_rep.${c}"))):_*)
)
.withColumn("_id",$"_id.oid")
.withColumn("id",$"id".cast("int"))
.as[blogRow]
scala> blogRowDF.show(false)
+---------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|_id |id |sparse_rep |title |
+---------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
|oid_value|null|Map(10 -> 10.0, 24 -> 24.0, 25 -> 25.0, 14 -> 14.0, 20 -> 20.0, 1 -> 1.0, 21 -> 21.0, 9 -> 9.0, 13 -> 13.0, 2 -> 2.0, 17 -> 17.0, 22 -> 22.0, 27 -> 27.0, 12 -> 12.0, 7 -> 7.0, 3 -> 3.0, 18 -> 18.0, 11 -> 11.0, 26 -> 26.0, 23 -> 23.0, 4 -> 4.0, 15 -> 15.0)|title_value|
+---------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+
另一种选择——
df.printSchema()
/**
* root
* |-- _id: struct (nullable = true)
* | |-- oid: string (nullable = true)
* |-- id: string (nullable = true)
* |-- sparse_rep: struct (nullable = true)
* | |-- 1: double (nullable = true)
* | |-- 10: double (nullable = true)
* | |-- 11: double (nullable = true)
* | |-- 12: double (nullable = true)
* | |-- 13: double (nullable = true)
* | |-- 14: double (nullable = true)
* | |-- 15: double (nullable = true)
* | |-- 17: double (nullable = true)
* | |-- 18: double (nullable = true)
* | |-- 2: double (nullable = true)
* | |-- 20: double (nullable = true)
* | |-- 21: double (nullable = true)
* | |-- 22: double (nullable = true)
* | |-- 23: double (nullable = true)
* | |-- 24: double (nullable = true)
* | |-- 25: double (nullable = true)
* | |-- 26: double (nullable = true)
* | |-- 27: double (nullable = true)
* | |-- 3: double (nullable = true)
* | |-- 4: double (nullable = true)
* | |-- 7: double (nullable = true)
* | |-- 9: double (nullable = true)
* |-- title: string (nullable = true)
*/
Dataset[Row]
-> Dataset[BlogRow]
val ds =
df.withColumn("sparse_rep", expr("from_json(to_json(sparse_rep), 'map<int, double>')"))
.withColumn("_id",$"_id.oid")
.withColumn("id",$"id".cast("int"))
.as[BlogRow]
ds.printSchema()
/**
* root
* |-- _id: string (nullable = true)
* |-- id: integer (nullable = true)
* |-- sparse_rep: map (nullable = true)
* | |-- key: integer
* | |-- value: double (valueContainsNull = true)
* |-- title: string (nullable = true)
*/
其中案例 class 如下 -
case class BlogRow(_id:String, id:Int, sparse_rep:Map[Int,Double],title:String)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.