简体   繁体   中英

How to apply SVD on TF-IDF Dataframe in pyspark

I have applied the pyspark tf-idf functions and get back the following results.

| features |
|----------|
| (35,[7,9,11,12,19,26,33],[1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003,1.6094379124341003,1.6094379124341003,1.6094379124341003])  |
| (35,[0,2,4,5,6,11,22],[0.9162907318741551,0.9162907318741551,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003]) |

So a dataframe having 1 column (features) which contains SparseVectors as rows.

Now i want to build the IndexRowMatrix from this dataframe so that i can run the svd function which is described over here https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=svd#pyspark.mllib.linalg.distributed.IndexedRowMatrix.computeSVD

I have tried the following but didn't work:

mat = RowMatrix(tfidfData.rdd.map(lambda x: x.features))

TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

I used RowMatrix because to construct it i don't have to provide tuple but i can't even build RowMatrix. IndexedRowMatrix will be more difficult for me.

So how to run the IndexedRowMatrix on the out put of tf-idf dataframe in pyspark ?

I am able to solve it. So as error suggested that RowMatrix won't accept pyspark.ml.linalg.SparseVector vector, So I converted this vector into pyspark.mllib.linalg Pay attention to ml and mllib . Now the following is the code snippet which will convert TF-IDF output to RowMatrix and you apply computeSVD method on it.

from pyspark.mllib.linalg import Vectors
mat = RowMatrix(df.rdd.map(lambda v: Vectors.dense(v.rawFeatures.toArray()) ))

I have converted to Dense matrix but you can write some extra lines of code to convert ml.linalg.SparseVector into mllib.linalg.SparseVector

Please excuse for not commenting in the original answer, I don't have requisite reputation points yet. To speed up things it would be beneficial to create a mllib.linalg.SparseVector . Its really straightforward if the following modification is made:

from pyspark.mllib.linalg import Vectors
mat = RowMatrix(df.rdd.map(lambda v: Vectors.fromML(v.rawFeatures)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM