简体   繁体   English

PySpark-稀疏向量列到矩阵

[英]PySpark - SparseVector Column to Matrix

I am very new to using PySpark. 我对使用PySpark非常陌生。 I have a column of SparseVectors in my PySpark dataframe. 我的PySpark数据框中有一列SparseVectors。

rescaledData.select('features').show(5,False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(262144,[43953,62425,66522,148962,174441,249180],[3.9219733362813143,3.9219733362813143,1.213923135179104,3.9219733362813143,3.9219733362813143,0.5720692490067093])|
|(262144,[57925,66522,90939,249180],[3.5165082281731497,1.213923135179104,3.9219733362813143,0.5720692490067093])                                                    |
|(262144,[23366,45531,73408,211290],[2.6692103677859462,3.005682604407159,3.5165082281731497,3.228826155721369])                                                     |
|(262144,[30913,81939,99546,137643,162885,249180],[3.228826155721369,3.9219733362813143,3.005682604407159,3.005682604407159,3.228826155721369,1.1441384980134186])   |
|(262144,[108134,152329,249180],[3.9219733362813143,2.6692103677859462,2.8603462450335466])                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I need to convert the above dataframe into a matrix where every row in the matrix corresponds to a SparseVector in that exact row in the dataframe. 我需要将上述数据框转换为矩阵,其中矩阵中的每一行都对应于该数据框中确切行中的SparseVector。

for example, 例如,

+-----------------+
|features         |
+-----------------+
|(7,[1,2],[45,63])|
|(7,[3,5],[85,69])|
|(7,[1,2],[89,56])|
+-----------------+

Must be converted to 必须转换为

[[0,45,63,0,0,0,0]
[0,0,0,85,0,69,0]
[0,89,56,0,0,0,0]]

I have read the link below, which shows that there is a function toArray() which does exactly what I want. 我已经阅读了下面的链接,该链接显示有一个函数toArray()正是我想要的。 https://mingchen0919.github.io/learning-apache-spark/pyspark-vectors.html https://mingchen0919.github.io/learning-apache-spark/pyspark-vectors.html

However, I am having trouble using it. 但是,我在使用它时遇到了麻烦。

vector_udf = udf(lambda vector: vector.toArray())
rescaledData.withColumn('features_', vector_udf(rescaledData.features)).first()

I need it to convert every row into an array and then convert the PySpark dataframe into a matrix. 我需要它将每一行转换为数组,然后将PySpark数据帧转换为矩阵。

Convert to RDD and map : 转换为RDDmap

vectors = df.select("features").rdd.map(lambda row: row.features)

Convert result to distributed matrix: 将结果转换为分布式矩阵:

from pyspark.mllib.linalg.distributed import RowMatrix

matrix = RowMatrix(vectors)

If you want DenseVectors (memory requirements!): 如果要DenseVectors (内存要求!):

vectors = df.select("features").rdd.map(lambda row: row.features.toArray())

toArray() will return numpy array. toArray()将返回numpy数组。 we can convert to list and then collect the dataframe. 我们可以转换为列表,然后收集数据框。

from pyspark.sql.types import *
vector_udf = udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))

df.show() ## my sample dataframe
+-------------------+
|           features|
+-------------------+
|(4,[1,3],[3.0,4.0])|
|(4,[1,3],[3.0,4.0])|
|(4,[1,3],[3.0,4.0])|
+-------------------+

colvalues = df.select(vector_udf('features').alias('features')).collect()

list(map(lambda x:x.features,colvalues))
[[0.0, 3.0, 0.0, 4.0], [0.0, 3.0, 0.0, 4.0], [0.0, 3.0, 0.0, 4.0]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 PySpark在Dataframe列中插入常量SparseVector - PySpark insert a constant SparseVector in a Dataframe column 在Pyspark ML中的sparsevector数据类型列上创建一个Python转换器 - Create a Python transformer on sparsevector data type column in Pyspark ML Pyspark:SparseVector的总和错误 - Pyspark: sum error with SparseVector Pyspark中的SparseVector到DenseVector转换 - SparseVector to DenseVector conversion in Pyspark 使用 SparseVector PySpark 创建 dataframe - Create a dataframe with SparseVector PySpark PySpark:如何将具有 SparseVector 类型的列的 Spark 数据帧写入 CSV 文件? - PySpark: How to write a Spark dataframe having a column with type SparseVector into CSV file? PySpark 按给定 SparseVector() 索引处的值过滤 - PySpark filter by value at given SparseVector() index 如何使用 PySpark 将 SparseVector 中的前 X 个单词获取到字符串数组 - How to get the top X of words from a SparseVector to a string array with PySpark 用于构造ClassDict的预期零参数(用于pyspark.ml.linalg.SparseVector) - expected zero arguments for construction of ClassDict (for pyspark.ml.linalg.SparseVector) 试图在pyspark中使用带有sparseVector的输出找到具有最大tf-idf的单词或bigram - trying to find the word or bigram with the largest tf-idf using a output with sparseVector in pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM