I calculated the TFIDF for 3 sample text documents using HashingTF and IDF of Pyspark and I got the following SparseVector result:
(1048576,[558379],[1.43841036226])
(1048576,[181911,558379,959994], [0.287682072452,0.287682072452,0.287682072452])
(1048576,[181911,959994],[0.287682072452,0.287682072452])
How to calculate the sum of TFIDF values for all terms within a document. Eg (0.287682072452 + 0.287682072452) for the 3d document.
Output from IDF
is just a PySpark SparseVector
when it is exposed to Python and its values are standard NumPy array
so all you need is sum
call:
from pyspark.mllib.linalg import SparseVector
v = SparseVector(1048576,[181911,959994],[0.287682072452,0.287682072452])
v.values.sum()
## 0.57536414490400001
or over RDD:
rdd = sc.parallelize([
SparseVector(1048576,[558379],[1.43841036226]),
SparseVector(1048576, [181911,558379,959994],
[0.287682072452,0.287682072452,0.287682072452]),
SparseVector(1048576,[181911,959994],[0.287682072452,0.287682072452])])
rdd.map(lambda v: v.values.sum())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.