如何在Spark数据集中创建TypedColumn并进行操作？

Question

I'm trying to perform an aggregation using mapGroups that returns a SparseMatrix as one of the columns, and sum the columns. 我正在尝试使用mapGroups执行聚合，该聚合返回SparseMatrix作为列之一，并对列求和。

I created a case class schema for the mapped rows in order to provide column names. 我为映射的行创建了一个case class架构，以提供列名。 The matrix column is typed org.apache.spark.mllib.linalg.Matrix . 矩阵列的类型为org.apache.spark.mllib.linalg.Matrix 。 If I don't run toDF before performing the aggregation ( select(sum("mycolumn") ) I get one type mismatch error ( required: org.apache.spark.sql.TypedColumn[MySchema,?] ). If I include toDF I get another type mismatch error: cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT . So what's the right way to do it? 如果我在执行聚合（ select(sum("mycolumn") ）之前未运行toDF ， select(sum("mycolumn")收到一种类型不匹配错误（ required: org.apache.spark.sql.TypedColumn[MySchema,?] ）。如果包含toDF我收到另一个类型不匹配错误： cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT 。那么正确的方法是什么？

Answer 1

It looks you struggle with at least two distinct problems here. 您似乎在这里遇到至少两个不同的问题。 Lets assume you have Dataset like this: 假设您具有这样的Dataset ：

val ds = Seq(
  ("foo",  Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))), 
  ("foo",  Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)))
).toDS

Selecting TypedColumn : 选择TypedColumn ：

using implicit conversions with $ : 使用$进行隐式转换：
```
 ds.select(col("_1").as[String]) 
```
using oassql.functions.col : 使用oassql.functions.col ：
```
 ds.select(col("_1").as[String]) 
```

Adding matrices: 添加矩阵：

MLLib Matrix and MatrixUDT don't implement addition. MLLib Matrix和MatrixUDT不实现加法。 It means you won't be able to sum function or reduce with + 这意味着您将无法使用+ sum或sum
you can use third party linear algebra library but this is not supported in Spark SQL / Spark Dataset 您可以使用第三方线性代数库，但Spark SQL / Spark Dataset不支持此功能

If you really want to do it with Datsets you can try to do something like this: 如果您真的想与Datsets一起Datsets ，可以尝试执行以下操作：

ds.groupByKey(_._1).mapGroups(
  (key, values) => {
    val matrices = values.map(_._2.toArray)
    val first = matrices.next
    val sum = matrices.foldLeft(first)(
      (acc, m) => acc.zip(m).map { case (x, y) => x + y }
    )
    (key, sum)
})

and map back to matrices but personally I would just convert to RDD and use breeze . 并映射回矩阵，但就我个人而言，我只是转换为RDD并使用breeze 。

如何在Spark数据集中创建TypedColumn并进行操作？

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-07-21 23:25:36

如何在Spark数据集中创建TypedColumn并进行操作？

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-07-21 23:25:36

解决方案1
2 已采纳 2016-07-21 23:25:36