[英]How to create a TypedColumn in a Spark Dataset and manipulate it?
I'm trying to perform an aggregation using mapGroups
that returns a SparseMatrix as one of the columns, and sum the columns. 我正在尝试使用
mapGroups
执行聚合,该聚合返回SparseMatrix作为列之一,并对列求和。
I created a case class
schema for the mapped rows in order to provide column names. 我为映射的行创建了一个
case class
架构,以提供列名。 The matrix column is typed org.apache.spark.mllib.linalg.Matrix
. 矩阵列的类型为
org.apache.spark.mllib.linalg.Matrix
。 If I don't run toDF
before performing the aggregation ( select(sum("mycolumn")
) I get one type mismatch error ( required: org.apache.spark.sql.TypedColumn[MySchema,?]
). If I include toDF
I get another type mismatch error: cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT
. So what's the right way to do it? 如果我在执行聚合(
select(sum("mycolumn")
)之前未运行toDF
, select(sum("mycolumn")
收到一种类型不匹配错误( required: org.apache.spark.sql.TypedColumn[MySchema,?]
)。如果包含toDF
我收到另一个类型不匹配错误: cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT
。那么正确的方法是什么?
It looks you struggle with at least two distinct problems here. 您似乎在这里遇到至少两个不同的问题。 Lets assume you have
Dataset
like this: 假设您具有这样的
Dataset
:
val ds = Seq(
("foo", Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))),
("foo", Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)))
).toDS
Selecting TypedColumn
: 选择
TypedColumn
:
using implicit conversions with $
: 使用
$
进行隐式转换:
ds.select(col("_1").as[String])
using oassql.functions.col
: 使用
oassql.functions.col
:
ds.select(col("_1").as[String])
Adding matrices: 添加矩阵:
Matrix
and MatrixUDT
don't implement addition. Matrix
和MatrixUDT
不实现加法。 It means you won't be able to sum
function or reduce with +
+
sum
或sum
If you really want to do it with Datsets
you can try to do something like this: 如果您真的想与
Datsets
一起Datsets
,可以尝试执行以下操作:
ds.groupByKey(_._1).mapGroups(
(key, values) => {
val matrices = values.map(_._2.toArray)
val first = matrices.next
val sum = matrices.foldLeft(first)(
(acc, m) => acc.zip(m).map { case (x, y) => x + y }
)
(key, sum)
})
and map back to matrices but personally I would just convert to RDD and use breeze
. 并映射回矩阵,但就我个人而言,我只是转换为RDD并使用
breeze
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.