简体   繁体   English

Apache scala spark中数据帧列中稀疏向量的大小

[英]Size of the sparse vector in the column of a data-frame in Apache scala spark

I am using a vector assembler to transform a dataframe.我正在使用向量汇编程序来转换数据帧。

var stringAssembler = new VectorAssembler().setInputCols(encodedstringColumns).setOutputCol("stringFeatures")
df = stringAssembler.transform(df)
**var stringVectorSize = df.select("stringFeatures").head.size**
var stringPca = new PCA().setInputCol("stringFeatures").setOutputCol("pcaStringFeatures").setK(stringVectorSize).fit(output)

Now stringVectorSize will tell PCA how many columns to keep while performing pca.现在 stringVectorSize 将告诉 PCA 在执行 pca 时要保留多少列。 I am trying to get the size of the output sparse vector from the vector assembler but my code gives size = 1 which is wrong.我试图从向量汇编器中获取输出稀疏向量的大小,但我的代码给出了 size = 1,这是错误的。 What is the right code to get the size of a sparse vector which is the part of a dataframe column.获取作为数据帧列一部分的稀疏向量大小的正确代码是什么。

To put it plainly说白了

+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
|PetalLengthCm|PetalWidthCm|SepalLengthCm|SepalWidthCm| Id|    Species|Species_Encoded|       Id_Encoded|      stringFeatures|
+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
|          1.4|         0.2|          5.1|         3.5|  1|Iris-setosa|  (2,[0],[1.0])| (149,[91],[1.0])|(151,[91,149],[1....|
|          1.4|         0.2|          4.9|         3.0|  2|Iris-setosa|  (2,[0],[1.0])|(149,[119],[1.0])|(151,[119,149],[1...|
|          1.3|         0.2|          4.7|         3.2|  3|Iris-setosa|  (2,[0],[1.0])|(149,[140],[1.0])|(151,[140,149],[1...|

For the above dataframe .对于上述数据框。 I want to extract the size of stringFeatures sparse vector ( which is 151)我想提取 stringFeatures 稀疏向量的大小(即 151)

If you read DataFrame's documentation you will notice that the head method returns a Row .如果您阅读DataFrame 的文档,您会注意到head方法返回一个Row Therefore, rather than obtaining your SparseVector 's size, you are obtaining Row 's size.因此,您不是获取SparseVector的大小,而是获取Row的大小。 Thus, to solve this you have to extract the element stored in the Row .因此,要解决这个问题,您必须提取存储在Row 中的元素。

val row = df.select("stringFeatures").head 
val vector = vector(0).asInstanceOf[SparseVector]
val size = vector.size

For instance:例如:

import sqlContext.implicits._
import org.apache.spark.mllib.linalg.SparseVector

val df = sc.parallelize(Array(10,2,3,4)).toDF("n")
val pepe = udf((i: Int) => new SparseVector(i, Array(i-1), Array(i)))
val x = df.select(pepe(df("n")).as("n"))

x.show()

+---------------+
|              n|
+---------------+
|(10,[9],[10.0])|
|  (2,[1],[2.0])|
|  (3,[2],[3.0])|
|  (4,[3],[4.0])|
+---------------+

val y = x.select("n").head

y(0).asInstanceOf[SparseVector].size
res12: Int = 10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM