Spark ml余弦相似度：如何获得1到n相似度分数

Question

I read that I could use the columnSimilarities method that comes with RowMatrix to find the cosine similarity of various records (content-based). 我读我可以使用columnSimilarities自带的方法RowMatrix找到的各种记录（基于内容的）余弦相似性。 My data looks something like this: 我的数据如下所示：

genre,actor
horror,mohanlal shobhana pranav 
comedy,mammooty suraj dulquer
romance,fahad dileep manju
comedy,prithviraj

Now,I have created a spark-ml pipeline to calculate the tf-idf of the above text features (genre, actor) and uses the VectorAssembler in my pipeline to assemble both the features into a single column "features". 现在，我创建了一个spark-ml管道来计算上述文本特征（体裁，演员）的tf-idf，并在管道中使用VectorAssembler两个特征组装为一个单列“特征”。 After that, I convert my obtained DataFrame using this : 之后，我使用以下代码转换获得的DataFrame ：

val vectorRdd = finalDF.map(row => row.getAs[Vector]("features"))

to convert it into an RDD[Vector] 将其转换为RDD[Vector]

Then, I obtain my RowMatrix by 然后，我通过获取我的RowMatrix

val matrix = new RowMatrix(vectorRdd)

I am following this guide for a reference to cosine similarity and what I need is a method in spark-mllib to find the similarity between a particular record and all the others like this method in sklearn, as shown in the guide : 我下面这个指南，以余弦相似，我需要什么参考火花mllib的方法来找到特定的记录和所有其他人等之间的相似性这在sklearn方法，如图所示，指南：

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

But, I am unable to find how to do this. 但是，我找不到如何执行此操作。 I don't understand what matrix.columnSimilarities() is comparing and returning. 我不明白什么matrix.columnSimilarities()正在比较和返回。 Can someone help me with what I am looking for? 有人可以帮我找我的东西吗？

Any help is appreciated! 任何帮助表示赞赏！ Thanks. 谢谢。

Answer 1

I had calculated it myself with 2 small functions. 我自己用2个小函数进行了计算。 Call cosineSimilarity on crossJoin of 2 dataframes.(seperate 1st line and others into 2) 在2个数据帧的crossJoin上调用cosineSimilarity（将第一行和其他第一行分开）

def cosineSimilarity(vectorA: SparseVector, 
        vectorB:SparseVector,normASqrt:Double,normBSqrt:Double) :
    (Double,Double) = {
        var dotProduct = 0.0
        for (i <-  vectorA.indices){ 
            dotProduct += vectorA(i) * vectorB(i)
        }
        val div = (normASqrt * normBSqrt)
        if (div == 0 )
            (dotProduct,0)
        else
            (dotProduct,dotProduct / div)
    }

    val normSqrt : (org.apache.spark.ml.linalg.SparseVector => Double) = (vector: org.apache.spark.ml.linalg.SparseVector) => {
        var norm = 0.0
        for (i <- vector.indices ) {
            norm += Math.pow(vector(i), 2)
        }
        Math.sqrt(norm)
    }

Spark ml余弦相似度：如何获得1到n相似度分数

问题描述

1 个解决方案

解决方案1
0 2017-09-15 04:51:09

Spark ml余弦相似度：如何获得1到n相似度分数

问题描述

1 个解决方案

解决方案1 0 2017-09-15 04:51:09

解决方案1
0 2017-09-15 04:51:09