简体繁体中英

All-pairs similarity using tfidf vectors in pyspark

原文 2015-07-28 16:35:12 9 1 apache-spark/ machine-learning/ pyspark/ apache-spark-mllib/ tf-idf

I'm trying to find similar documents based on their text in spark. I'm using python with Spark.

So far I implemented RowMatrix, IndexedRowMatrix, and CoordinateMatrix to set this up. And then I implemented columnSimilarities (DIMSUM). The problem with DIMSUM is that it's optimized for a lot of features, a few items. http://stanford.edu/~rezab/papers/dimsum.pdf

Our initial approach was to create tf-idf vectors of all words in all documents, then transpose it into a rowmatrix where we have a row for each word and a column for each item. Then we ran columnSimilarities which gives us a coordinateMatrix of ((item_i, item_j), similarity). This just doesn't work well when number of columns > number of rows.

We need a way to calculate all-pairs similarity with a lot of items, a few features. #items=10^7 #features=10^4. At a higher level, we're trying to create an item based recommender that given one item, will return a few quality recommendations based only on the text.

1 answers

I'd write this as a comment isntead of an answer but SO won't let me commet yet.

This would be "trivially" solved by utilizing ElasticSearch's more-like-this query . From docs you can see how it works and which factors are taken into account, which should be useful info even if you end up implementing this in Python.

They have also implemented other interesting algorithms such as the significant terms aggregation .

Spark - Converting DataFrame to RowMatrix to compute all-pairs similarity efficiently

Counting all possible word pairs using pyspark

Is is possible to implemet all-pairs shortest path algorithm with parallel framework in large graph?

Pyspark: What is the Fastest way to Calculate Cosine Similarity against a Column of Vectors

PySpark vs sklearn TFIDF

Character-level TFIDF in PySpark?

Parsing all zero sparse vectors with pyspark SparseVectors

PySpark average TFIDF features by group

Transform rows and column and create a similarity dataframe using pyspark

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark - Converting DataFrame to RowMatrix to compute all-pairs similarity efficiently Cosine similarity on TFIDF using apache spark Counting all possible word pairs using pyspark Is is possible to implemet all-pairs shortest path algorithm with parallel framework in large graph? Pyspark: What is the Fastest way to Calculate Cosine Similarity against a Column of Vectors PySpark vs sklearn TFIDF Character-level TFIDF in PySpark? Parsing all zero sparse vectors with pyspark SparseVectors PySpark average TFIDF features by group Transform rows and column and create a similarity dataframe using pyspark

Related Tags

All-pairs similarity using tfidf vectors in pyspark

Question

1 answers

solution1 0 2015-09-28 19:22:51

solution1
0 2015-09-28 19:22:51