Summation of TFIDF sparse vector values for each document in Spark with Python

Question

I calculated the TFIDF for 3 sample text documents using HashingTF and IDF of Pyspark and I got the following SparseVector result:

(1048576,[558379],[1.43841036226])
(1048576,[181911,558379,959994],  [0.287682072452,0.287682072452,0.287682072452])
(1048576,[181911,959994],[0.287682072452,0.287682072452])

How to calculate the sum of TFIDF values for all terms within a document. Eg (0.287682072452 + 0.287682072452) for the 3d document.

Answer 1

Output from IDF is just a PySpark SparseVector when it is exposed to Python and its values are standard NumPy array so all you need is sum call:

from pyspark.mllib.linalg import SparseVector

v = SparseVector(1048576,[181911,959994],[0.287682072452,0.287682072452])
v.values.sum()
## 0.57536414490400001

or over RDD:

rdd = sc.parallelize([
  SparseVector(1048576,[558379],[1.43841036226]),
  SparseVector(1048576, [181911,558379,959994],  
      [0.287682072452,0.287682072452,0.287682072452]),
  SparseVector(1048576,[181911,959994],[0.287682072452,0.287682072452])])

rdd.map(lambda v: v.values.sum())

Summation of TFIDF sparse vector values for each document in Spark with Python

Question

1 answers

solution1
2 ACCPTED 2016-02-26 16:36:34

Summation of TFIDF sparse vector values for each document in Spark with Python

Question

1 answers

solution1 2 ACCPTED 2016-02-26 16:36:34

solution1
2 ACCPTED 2016-02-26 16:36:34