简体   繁体   English

spark中的求和向量列

[英]Sum vector columns in spark

I have a dataframe where I have multiple columns that contain vectors (number of vector columns is dynamic).我有一个数据框,其中有多个包含向量的列(向量列的数量是动态的)。 I need to create a new column taking the sum of all the vector columns.我需要创建一个新列,取所有向量列的总和。 I'm having a hard time getting this done.我很难完成这件事。 here is a code to generate a sample dataset that I'm testing on.这是生成我正在测试的示例数据集的代码。

import org.apache.spark.ml.feature.VectorAssembler

val temp1 = spark.createDataFrame(Seq(
                                    (1,1.0,0.0,4.7,6,0.0),
                                    (2,1.0,0.0,6.8,6,0.0),
                                    (3,1.0,1.0,7.8,5,0.0),
                                    (4,0.0,1.0,4.1,7,0.0),
                                    (5,1.0,0.0,2.8,6,1.0),
                                    (6,1.0,1.0,6.1,5,0.0),
                                    (7,0.0,1.0,4.9,7,1.0),
                                    (8,1.0,0.0,7.3,6,0.0)))
                                    .toDF("id", "f1","f2","f3","f4","label")

val assembler1 = new VectorAssembler()
    .setInputCols(Array("f1","f2","f3"))
    .setOutputCol("vec1")

val temp2 = assembler1.setHandleInvalid("skip").transform(temp1)

val assembler2 = new VectorAssembler()
    .setInputCols(Array("f2","f3", "f4"))
    .setOutputCol("vec2")

val df = assembler2.setHandleInvalid("skip").transform(temp2)

This gives me the following dataset这给了我以下数据集

+---+---+---+---+---+-----+-------------+-------------+
| id| f1| f2| f3| f4|label|         vec1|         vec2|
+---+---+---+---+---+-----+-------------+-------------+
|  1|1.0|0.0|4.7|  6|  0.0|[1.0,0.0,4.7]|[0.0,4.7,6.0]|
|  2|1.0|0.0|6.8|  6|  0.0|[1.0,0.0,6.8]|[0.0,6.8,6.0]|
|  3|1.0|1.0|7.8|  5|  0.0|[1.0,1.0,7.8]|[1.0,7.8,5.0]|
|  4|0.0|1.0|4.1|  7|  0.0|[0.0,1.0,4.1]|[1.0,4.1,7.0]|
|  5|1.0|0.0|2.8|  6|  1.0|[1.0,0.0,2.8]|[0.0,2.8,6.0]|
|  6|1.0|1.0|6.1|  5|  0.0|[1.0,1.0,6.1]|[1.0,6.1,5.0]|
|  7|0.0|1.0|4.9|  7|  1.0|[0.0,1.0,4.9]|[1.0,4.9,7.0]|
|  8|1.0|0.0|7.3|  6|  0.0|[1.0,0.0,7.3]|[0.0,7.3,6.0]|
+---+---+---+---+---+-----+-------------+-------------+

If I needed to taek sum of regular columns, I can do it using something like,如果我需要计算常规列的总和,我可以使用类似的方法来完成,

import org.apache.spark.sql.functions.col

df.withColumn("sum", namesOfColumnsToSum.map(col).reduce((c1, c2)=>c1+c2))

I know I can use breeze to sum DenseVectors just using "+" operator我知道我可以使用微风仅使用“+”运算符来对 DenseVectors 求和

import breeze.linalg._
val v1 = DenseVector(1,2,3)
val v2 = DenseVector(5,6,7)
v1+v2

So, the above code gives me the expected vector.所以,上面的代码给了我预期的向量。 But I'm not sure how to take the sum of the vector columns and sum vec1 and vec2 columns.但我不知道如何利用矢量列和和的总vec1vec2列。

I did try the suggestions mentioned here , but had no luck我确实尝试过这里提到的建议,但没有运气

Here's my take but coded in PySpark.这是我的看法,但在 PySpark 中编码。 Someone can probably help in translating this to Scala:有人可能会帮助将其翻译成 Scala:

from pyspark.ml.linalg import Vectors, VectorUDT
import numpy as np
from pyspark.sql.functions import udf, array

def vector_sum (arr): 
    return Vectors.dense(np.sum(arr,axis=0))

vector_sum_udf = udf(vector_sum, VectorUDT())

df = df.withColumn('sum',vector_sum_udf(array(['vec1','vec2'])))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM