[英]Spark ml cosine similarity: how to get 1 to n similarity score
I read that I could use the columnSimilarities
method that comes with RowMatrix
to find the cosine similarity of various records (content-based). 我读我可以使用
columnSimilarities
自带的方法RowMatrix
找到的各种记录(基于内容的)余弦相似性。 My data looks something like this: 我的数据如下所示:
genre,actor
horror,mohanlal shobhana pranav
comedy,mammooty suraj dulquer
romance,fahad dileep manju
comedy,prithviraj
Now,I have created a spark-ml pipeline to calculate the tf-idf of the above text features (genre, actor) and uses the VectorAssembler
in my pipeline to assemble both the features into a single column "features". 现在,我创建了一个spark-ml管道来计算上述文本特征(体裁,演员)的tf-idf,并在管道中使用
VectorAssembler
两个特征组装为一个单列“特征”。 After that, I convert my obtained DataFrame
using this : 之后,我使用以下代码转换获得的
DataFrame
:
val vectorRdd = finalDF.map(row => row.getAs[Vector]("features"))
to convert it into an RDD[Vector]
将其转换为
RDD[Vector]
Then, I obtain my RowMatrix
by 然后,我通过获取我的
RowMatrix
val matrix = new RowMatrix(vectorRdd)
I am following this guide for a reference to cosine similarity and what I need is a method in spark-mllib to find the similarity between a particular record and all the others like this method in sklearn, as shown in the guide : 我下面这个指南,以余弦相似,我需要什么参考火花mllib的方法来找到特定的记录和所有其他人等之间的相似性这在sklearn方法,如图所示,指南:
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
But, I am unable to find how to do this. 但是,我找不到如何执行此操作。 I don't understand what
matrix.columnSimilarities()
is comparing and returning. 我不明白什么
matrix.columnSimilarities()
正在比较和返回。 Can someone help me with what I am looking for? 有人可以帮我找我的东西吗?
Any help is appreciated! 任何帮助表示赞赏! Thanks.
谢谢。
I had calculated it myself with 2 small functions. 我自己用2个小函数进行了计算。 Call cosineSimilarity on crossJoin of 2 dataframes.(seperate 1st line and others into 2)
在2个数据帧的crossJoin上调用cosineSimilarity(将第一行和其他第一行分开)
def cosineSimilarity(vectorA: SparseVector,
vectorB:SparseVector,normASqrt:Double,normBSqrt:Double) :
(Double,Double) = {
var dotProduct = 0.0
for (i <- vectorA.indices){
dotProduct += vectorA(i) * vectorB(i)
}
val div = (normASqrt * normBSqrt)
if (div == 0 )
(dotProduct,0)
else
(dotProduct,dotProduct / div)
}
val normSqrt : (org.apache.spark.ml.linalg.SparseVector => Double) = (vector: org.apache.spark.ml.linalg.SparseVector) => {
var norm = 0.0
for (i <- vector.indices ) {
norm += Math.pow(vector(i), 2)
}
Math.sqrt(norm)
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.