[英]Sum vector columns in spark
I have a dataframe where I have multiple columns that contain vectors (number of vector columns is dynamic).我有一个数据框,其中有多个包含向量的列(向量列的数量是动态的)。 I need to create a new column taking the sum of all the vector columns.
我需要创建一个新列,取所有向量列的总和。 I'm having a hard time getting this done.
我很难完成这件事。 here is a code to generate a sample dataset that I'm testing on.
这是生成我正在测试的示例数据集的代码。
import org.apache.spark.ml.feature.VectorAssembler
val temp1 = spark.createDataFrame(Seq(
(1,1.0,0.0,4.7,6,0.0),
(2,1.0,0.0,6.8,6,0.0),
(3,1.0,1.0,7.8,5,0.0),
(4,0.0,1.0,4.1,7,0.0),
(5,1.0,0.0,2.8,6,1.0),
(6,1.0,1.0,6.1,5,0.0),
(7,0.0,1.0,4.9,7,1.0),
(8,1.0,0.0,7.3,6,0.0)))
.toDF("id", "f1","f2","f3","f4","label")
val assembler1 = new VectorAssembler()
.setInputCols(Array("f1","f2","f3"))
.setOutputCol("vec1")
val temp2 = assembler1.setHandleInvalid("skip").transform(temp1)
val assembler2 = new VectorAssembler()
.setInputCols(Array("f2","f3", "f4"))
.setOutputCol("vec2")
val df = assembler2.setHandleInvalid("skip").transform(temp2)
This gives me the following dataset这给了我以下数据集
+---+---+---+---+---+-----+-------------+-------------+
| id| f1| f2| f3| f4|label| vec1| vec2|
+---+---+---+---+---+-----+-------------+-------------+
| 1|1.0|0.0|4.7| 6| 0.0|[1.0,0.0,4.7]|[0.0,4.7,6.0]|
| 2|1.0|0.0|6.8| 6| 0.0|[1.0,0.0,6.8]|[0.0,6.8,6.0]|
| 3|1.0|1.0|7.8| 5| 0.0|[1.0,1.0,7.8]|[1.0,7.8,5.0]|
| 4|0.0|1.0|4.1| 7| 0.0|[0.0,1.0,4.1]|[1.0,4.1,7.0]|
| 5|1.0|0.0|2.8| 6| 1.0|[1.0,0.0,2.8]|[0.0,2.8,6.0]|
| 6|1.0|1.0|6.1| 5| 0.0|[1.0,1.0,6.1]|[1.0,6.1,5.0]|
| 7|0.0|1.0|4.9| 7| 1.0|[0.0,1.0,4.9]|[1.0,4.9,7.0]|
| 8|1.0|0.0|7.3| 6| 0.0|[1.0,0.0,7.3]|[0.0,7.3,6.0]|
+---+---+---+---+---+-----+-------------+-------------+
If I needed to taek sum of regular columns, I can do it using something like,如果我需要计算常规列的总和,我可以使用类似的方法来完成,
import org.apache.spark.sql.functions.col
df.withColumn("sum", namesOfColumnsToSum.map(col).reduce((c1, c2)=>c1+c2))
I know I can use breeze to sum DenseVectors just using "+" operator我知道我可以使用微风仅使用“+”运算符来对 DenseVectors 求和
import breeze.linalg._
val v1 = DenseVector(1,2,3)
val v2 = DenseVector(5,6,7)
v1+v2
So, the above code gives me the expected vector.所以,上面的代码给了我预期的向量。 But I'm not sure how to take the sum of the vector columns and sum
vec1
and vec2
columns.但我不知道如何利用矢量列和和的总
vec1
及vec2
列。
I did try the suggestions mentioned here , but had no luck我确实尝试过这里提到的建议,但没有运气
Here's my take but coded in PySpark.这是我的看法,但在 PySpark 中编码。 Someone can probably help in translating this to Scala:
有人可能会帮助将其翻译成 Scala:
from pyspark.ml.linalg import Vectors, VectorUDT
import numpy as np
from pyspark.sql.functions import udf, array
def vector_sum (arr):
return Vectors.dense(np.sum(arr,axis=0))
vector_sum_udf = udf(vector_sum, VectorUDT())
df = df.withColumn('sum',vector_sum_udf(array(['vec1','vec2'])))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.