简体   繁体   中英

All-pairs similarity using tfidf vectors in pyspark

I'm trying to find similar documents based on their text in spark. I'm using python with Spark.

So far I implemented RowMatrix, IndexedRowMatrix, and CoordinateMatrix to set this up. And then I implemented columnSimilarities (DIMSUM). The problem with DIMSUM is that it's optimized for a lot of features, a few items. http://stanford.edu/~rezab/papers/dimsum.pdf

Our initial approach was to create tf-idf vectors of all words in all documents, then transpose it into a rowmatrix where we have a row for each word and a column for each item. Then we ran columnSimilarities which gives us a coordinateMatrix of ((item_i, item_j), similarity). This just doesn't work well when number of columns > number of rows.

We need a way to calculate all-pairs similarity with a lot of items, a few features. #items=10^7 #features=10^4. At a higher level, we're trying to create an item based recommender that given one item, will return a few quality recommendations based only on the text.

I'd write this as a comment isntead of an answer but SO won't let me commet yet.

This would be "trivially" solved by utilizing ElasticSearch's more-like-this query . From docs you can see how it works and which factors are taken into account, which should be useful info even if you end up implementing this in Python.

They have also implemented other interesting algorithms such as the significant terms aggregation .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM