简体   繁体   English

如果给定用户和项目的嵌入,如何获得所有用户和 PySpark 中所有项目的余弦相似度分数?

[英]How to get cosine similarity scores for all users and all the items in PySpark, if user's and item's embeddings are given?

I have a users df-我有一个用户 df-

df1 = spark.createDataFrame([
    ("u1", [0., 2., 3.]),
    ("u2", [1., 0., 0.]),
    ("u3", [0., 0., 3.]),
    ],
    ['user_id', 'features'])

print(df1.printSchema())
df1.show(truncate=False)

Output-输出-

root
 |-- user_id: string (nullable = true)
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = true)

None
+-------+---------------+
|user_id|features       |
+-------+---------------+
|u1     |[0.0, 2.0, 3.0]|
|u2     |[1.0, 0.0, 0.0]|
|u3     |[0.0, 0.0, 3.0]|
+-------+---------------+

And I have an items df-我有一个项目 df-

df2 = spark.createDataFrame([
    ("i1", [0., 2., 3.]),
    ("i2", [1.1, 0., 0.]),
    ("i3", [0., 0., 3.1]),
    ],
    ['item_id', 'features'])

print(df2.printSchema())
df2.show(truncate=False)

Output-输出-

root
 |-- item_id: string (nullable = true)
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = true)

None
+-------+---------------+
|item_id|features       |
+-------+---------------+
|i1     |[0.0, 2.0, 3.0]|
|i2     |[1.1, 0.0, 0.0]|
|i3     |[0.0, 0.0, 3.1]|
+-------+---------------+

How do I calculate the cosine similarity score for all the user-item pairs, such that it becomes easy for me to rank the items for every user?我如何计算所有用户-项目对的余弦相似度分数,这样我就可以轻松地为每个用户对项目进行排名?

The final dataframe should look something like-最后的 dataframe 看起来应该是这样的——

+-------+-------+-----------------+
|user_id|item_id|cosine_similarity|
+-------+-------+-----------------+
|u1     |     i1|      some number|
|u1     |     i2|      some number|
|u1     |     i3|      some number|
|u2     |     i1|      some number|
|u2     |     i2|      some number|
|u2     |     i3|      some number|
|u3     |     i1|      some number|
|u3     |     i2|      some number|
|u3     |     i3|      some number|
+-------+-------+-----------------+

Here is a way using sklearn and the underlying RDD:这是一种使用sklearn和底层 RDD 的方法:

from pyspark.sql import functions as F
from sklearn.metrics.pairwise import cosine_similarity

# Join DFs
df = df1.crossJoin(df2.select('item_id', F.col("features").alias("features_item")))

# Get cosine similarity
result = df.rdd.map(lambda x: (x['user_id'], x['item_id'],
                               float(
                                   cosine_similarity(
                                       [x['features']],
                                       [x['features_item']]
                                   )[0,0]
                               )
                              )
                   ).toDF(schema=['user_id', 'item_id', 'cosine_similarity'])

A manual implementation of cosine similarity:余弦相似度的手动实现:

import pyspark.sql.functions as F

size = df1.limit(1).select(F.size('features')).first()[0]
joined = df1.crossJoin(df2.withColumnRenamed('features', 'features2'))
result = joined.select(
    'user_id',
    'item_id',
    sum([F.col('features')[i] * F.col('features2')[i] for i in range(size)]).alias('dot_product'),
    F.sqrt(sum([F.col('features')[i] * F.col('features')[i] for i in range(size)])).alias('norm1'),
    F.sqrt(sum([F.col('features2')[i] * F.col('features2')[i] for i in range(size)])).alias('norm2')
).selectExpr(
    'user_id',
    'item_id',
    'dot_product / norm1 / norm2 as cosine_similarity'
)

result.show()
+-------+-------+------------------+
|user_id|item_id| cosine_similarity|
+-------+-------+------------------+
|     u1|     i1|1.0000000000000002|
|     u1|     i2|               0.0|
|     u1|     i3|0.8320502943378437|
|     u2|     i1|               0.0|
|     u2|     i2|               1.0|
|     u2|     i3|               0.0|
|     u3|     i1|0.8320502943378437|
|     u3|     i2|               0.0|
|     u3|     i3|               1.0|
+-------+-------+------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 计算pyspark中数据帧所有行之间的余弦相似度 - Calculating the cosine similarity between all the rows of a dataframe in pyspark 余弦相似度 pyspark - Cosine similarity pyspark 如何获得排名余弦相似度? - How to get the ranked cosine similarity? 如何对字符串数组运行spaCy的句子相似度函数以获取分数数组? - How to run spaCy's sentence similarity function to an array of strings to get an array of scores? 在 dataframe 中计算组中所有列的余弦相似度 - calculate cosine similarity for all columns in a group by in a dataframe Word Mover 的距离与余弦相似度 - Word Mover's Distance vs Cosine Similarity 同一字典值之间的余弦相似度 - Cosine similarity between the same dictionary's values PYSPARK:如何在pyspark数据框中找到两列的余弦相似度? - PYSPARK: How to find cosine similarity of two columns in a pyspark dataframe? 给定稀疏矩阵数据,Python 中计算余弦相似度的最快方法是什么? - What's the fastest way in Python to calculate cosine similarity given sparse matrix data? 我是否通过使用 sklearn 的余弦相似度方法和 K-means 算法正确地对用户进行聚类? - Am i clustering users correctly by using sklearn's cosine similarity method and K-means algorithm?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM