简体   繁体   English

如何在Spark数据集中创建TypedColumn并进行操作?

[英]How to create a TypedColumn in a Spark Dataset and manipulate it?

I'm trying to perform an aggregation using mapGroups that returns a SparseMatrix as one of the columns, and sum the columns. 我正在尝试使用mapGroups执行聚合,该聚合返回SparseMatrix作为列之一,并对列求和。

I created a case class schema for the mapped rows in order to provide column names. 我为映射的行创建了一个case class架构,以提供列名。 The matrix column is typed org.apache.spark.mllib.linalg.Matrix . 矩阵列的类型为org.apache.spark.mllib.linalg.Matrix If I don't run toDF before performing the aggregation ( select(sum("mycolumn") ) I get one type mismatch error ( required: org.apache.spark.sql.TypedColumn[MySchema,?] ). If I include toDF I get another type mismatch error: cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT . So what's the right way to do it? 如果我在执行聚合( select(sum("mycolumn") )之前未运行toDFselect(sum("mycolumn")收到一种类型不匹配错误( required: org.apache.spark.sql.TypedColumn[MySchema,?] )。如果包含toDF我收到另一个类型不匹配错误: cannot resolve 'sum(mycolumn)' due to data type mismatch: function sum requires numeric types, not org.apache.spark.mllib.linalg.MatrixUDT 。那么正确的方法是什么?

It looks you struggle with at least two distinct problems here. 您似乎在这里遇到至少两个不同的问题。 Lets assume you have Dataset like this: 假设您具有这样的Dataset

val ds = Seq(
  ("foo",  Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))), 
  ("foo",  Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)))
).toDS

Selecting TypedColumn : 选择TypedColumn

  • using implicit conversions with $ : 使用$进行隐式转换:

     ds.select(col("_1").as[String]) 
  • using oassql.functions.col : 使用oassql.functions.col

     ds.select(col("_1").as[String]) 

Adding matrices: 添加矩阵:

  • MLLib Matrix and MatrixUDT don't implement addition. MLLib MatrixMatrixUDT不实现加法。 It means you won't be able to sum function or reduce with + 这意味着您将无法使用+ sumsum
  • you can use third party linear algebra library but this is not supported in Spark SQL / Spark Dataset 您可以使用第三方线性代数库,但Spark SQL / Spark Dataset不支持此功能

If you really want to do it with Datsets you can try to do something like this: 如果您真的想与Datsets一起Datsets ,可以尝试执行以下操作:

ds.groupByKey(_._1).mapGroups(
  (key, values) => {
    val matrices = values.map(_._2.toArray)
    val first = matrices.next
    val sum = matrices.foldLeft(first)(
      (acc, m) => acc.zip(m).map { case (x, y) => x + y }
    )
    (key, sum)
})

and map back to matrices but personally I would just convert to RDD and use breeze . 并映射回矩阵,但就我个人而言,我只是转换为RDD并使用breeze

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM