简体   繁体   English

Pyspark LSH 后跟余弦相似度

[英]Pyspark LSH Followed by Cosine Similarity

I have many users and each user has an associated vector.我有很多用户,每个用户都有一个关联的向量。 I would like to compute the cosine similarity between each user.我想计算每个用户之间的余弦相似度。 This is prohibitive based on the size.这是基于大小的禁止。 It seems LSH is a good approximation step, which I understand will create buckets where the users in this case, are mapped to the same bucket where there is high probability that they are similar.似乎 LSH 是一个很好的近似步骤,我知道这将创建存储桶,在这种情况下,用户被映射到同一个存储桶,它们很可能是相似的。 In Pyspark, the following example:在 Pyspark 中,以下示例:

from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

dataA = [(0, Vectors.dense([1.0, 1.0]),),
         (1, Vectors.dense([1.0, -1.0]),),
         (4, Vectors.dense([1.0, -1.0]),),
         (5, Vectors.dense([1.1, -1.0]),),
         (2, Vectors.dense([-1.0, -1.0]),),
         (3, Vectors.dense([-1.0, 1.0]),)]
dfA = ss.createDataFrame(dataA, ["id", "features"])

brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=1.0, numHashTables=3)
model = brp.fit(dfA)
model.transform(dfA).show(truncate=False)


+---+-----------+-----------------------+
|id |features   |hashes                 |
+---+-----------+-----------------------+
|0  |[1.0,1.0]  |[[-1.0], [0.0], [-1.0]]|
|1  |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]|
|4  |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]|
|5  |[1.1,-1.0] |[[-2.0], [-2.0], [1.0]]|
|2  |[-1.0,-1.0]|[[0.0], [-1.0], [0.0]] |
|3  |[-1.0,1.0] |[[1.0], [1.0], [-2.0]] |
+---+-----------+-----------------------+

Any pointers to how to best set bucketLength and numHashTables are appreciated.任何关于如何最好地设置 bucketLength 和 numHashTables 的指针都值得赞赏。

Assuming I have the above with 3 hash tables, how can I determine the buckets from within each to calculate the cosine similarity given that there are more than 1?假设我有上面的 3 个哈希表,如果有超过 1 个,我如何确定每个桶中的桶来计算余弦相似度? I assumed the use of LSH for this task is to group by the value in the "hashes" column and only perform pairwise similarity within each.我假设在此任务中使用 LSH 是按“哈希”列中的值进行分组,并且只在每个列中执行成对相似性。 Is this correct?这个对吗?

I assumed the use of LSH for this task is to group by the value in the "hashes" column and only perform pairwise similarity within each.我假设在此任务中使用 LSH 是按“哈希”列中的值进行分组,并且只在每个列中执行成对相似性。 Is this correct?这个对吗?

Yes , LSH uses a method to reduce dimensionality while preserving similarity.的,LSH 使用一种方法来降低维度,同时保持相似性。 It hashes your data into a bucket.它将您的数据散列到存储桶中。 Only items that end up in the same bucket are then compared.(Distance calculated)然后仅比较最终位于同一存储桶中的项目。(计算距离)

The magic is tuning the number of buckets and hash functions to reduce the number of false positives and false negatives.神奇之处在于调整存储桶和散列函数的数量,以减少误报和漏报的数量。 There isn't a set number it depends on your data.没有固定的数字,这取决于您的数据。

r is your bucket size, b is the number of hash functions to use(Or the number of buckets you will be using to detect matches. r是您的存储桶大小, b是要使用的哈希函数的数量(或者您将用于检测匹配的存储桶数量。

From this article that helped me understand what was happening.这篇帮助我了解发生了什么的文章。

Let's say your signature matrix has 100 rows.假设您的签名矩阵有 100 行。 Consider 2 cases:考虑两种情况:

b1 = 10 → r = 10 b1 = 10 → r = 10

b2 = 20 → r = 5 b2 = 20 → r = 5

In 2nd case, there is higher chance for 2 [vectors] to appear in same bucket at least once as they have more opportunities (20 vs 10) and fewer elements of the signature are getting compared (5 vs 10)在第二种情况下,2 个 [向量] 至少出现在同一个存储桶中的机会更高,因为它们有更多的机会(20 对 10)并且比较签名的元素更少(5 对 10)

If you need to join you can use: approxSimilarityJoin and set the acceptable distance .如果您需要加入,您可以使用: approxSimilarityJoin并设置可接受的distance (This is another parameter that you need to tune, Distance is the distance between vectors that have fallen into at least on of the hash buckets, making them likely to be close to eachother.) (这是您需要调整的另一个参数,距离是已落入至少一个哈希桶的向量之间的距离,使它们可能彼此靠近。)

distance = 300

model.approxSimilarityJoin(df, df2, distance, distCol="EuclideanDistance").select(
    col("datasetA.id").alias("idA"),
    col("datasetB.id").alias("idB"),
    col("EuclideanDistance")).show()

You can get an idea of what reasonable for the distance between vectors by reviewing the data(from the join) or playing around with approxNearestNeighbors .您可以通过查看数据(来自连接)或使用approxNearestNeighbors来了解向量之间的距离的合理性。 If you want 10 nearest neighbors here's how you can find there distance:如果您想要 10 个最近的邻居,您可以通过以下方法找到距离:

NumberOfNeigthbors = 10
CandidateVector = Vectors.dense([1.0, 2.0])
model.approxNearestNeighbors(df2, CandidateVector, NumberOfNeigthbors).collect()
[Row(id=4, features=DenseVector([2.0, 2.0]), hashes=[DenseVector([1.0])], distCol=1.0)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM