简体   繁体   English

计算 Spark 中两列之间的余弦距离

[英]Calculate the cosine distance between two column in Spark

I am using a Python & Spark to solve an issue.我正在使用 Python 和 Spark 来解决问题。 I have dataframe containing two columns in a Spark Dataframe Each of the columns contain a scalar of numeric(eg double or float) type.我有 dataframe 在 Spark Dataframe 中包含两列,每一列都包含一个数字(例如双精度或浮点)类型的标量。

I want to interpret these two column as vector and calculate consine similarity between them.我想将这两列解释为向量并计算它们之间的余弦相似度。 Sofar I only found spark linear algebra that can be used on densevector that are located in cell of the dataframe.到目前为止,我只发现了可以在位于 dataframe 单元格中的密集向量上使用的火花线性代数。

code sample代码示例

Code in numpy numpy 中的代码

import numpy as np
from numpy.linalg import norm
vec = np.array([1, 2])
vec_2 = np.array([2, 1])
angle_vec_vec = (np.dot(vec, vec))/(norm(vec * norm(vec)))
print(angle_vec_vec )

Result should 0.8结果应为 0.8

How to do this in Spark?如何在 Spark 中做到这一点?

df_small = spark.createDataFrame([(1, 2), (2, 1)])
df_small.show()

Is there a way to convert a column of double values to a densevector?有没有办法将一列双精度值转换为密集向量? Do you see any other soluation to solve my problem?您是否看到任何其他解决方案来解决我的问题?

You can see here a sample that calculates the cosine distance in Scala.您可以在此处查看计算 Scala 中余弦距离的示例。 The strategy is to represent the documents as a RowMatrix and then use its columnSimilarities() method.策略是将文档表示为 RowMatrix,然后使用其 columnSimilarities() 方法。

If you want to use PySpark, you can try what's suggested here如果你想使用 PySpark,你可以试试这里的建议

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM