简体   繁体   English

在pyspark中使用基于DataFrame的API在2个稀疏向量列表之间进行矩阵乘法的最佳方法是什么?

[英]What's the best way to do matrix multiplication between 2 lists of sparseVectors with DataFrame-based API in pyspark?

I have 2 DataFrame s with the same structure: DataFrame[id: bigint, tfidf_features: vector]我有 2 DataFrame具有相同结构的DataFrame[id: bigint, tfidf_features: vector]DataFrame[id: bigint, tfidf_features: vector]

I need to multiple rows in dataframe1 with rows in dataframe2 .我需要在多行dataframe1与行dataframe2 I can use a loop and do things like: dataframe1.collect()[i]['tfidf_features'].dot(dataframe2.collect()[j]['tfidf_features']) .我可以使用循环并执行以下操作: dataframe1.collect()[i]['tfidf_features'].dot(dataframe2.collect()[j]['tfidf_features'])

However, I would like to use matrix multiplication, something equivalent to: np.matmul(dataframe1_tfidf_features, dataframe2_tfidf_features.T) .但是,我想使用矩阵乘法,相当于: np.matmul(dataframe1_tfidf_features, dataframe2_tfidf_features.T)

You have two choices你有两个选择
1. mllib.linalg.distributed.BlockMatrix convert both dataframes to block matrices and use mulitply 1. mllib.linalg.distributed.BlockMatrix将两个数据帧转换为块矩阵并使用mulitply

bm1 = IndexedRowMatrix(df1.rdd.map(lambda x: IndexedRow(x[0], x[1]))).toBlockMatrix()
bm2 = IndexedRowMatrix(df2.rdd.map(lambda x: IndexedRow(x[0], x[1]))).toBlockMatrix()
bm_result = bm1.multiply(bm2)  

2. pyspark.sql.dataframe.crossJoin crossjoin both dataframes and calculate individual element of resultant matrix and then use collect_list & sort 2. pyspark.sql.dataframe.crossJoin交叉连接两个数据帧并计算结果矩阵的单个元素,然后使用 collect_list & sort

arr = np.array
df =df1.crossJoin(df2.select(col("id").alias("id2"),
                                 col("features").alias("features2"))

udf_mult = udf(lambda x,y = float(arr(x).dot(arr(y).T).sum()), DoubleType()) 
df = df.withColumn("val", udf_mult("features","features2")).
                         drop("features","features2")
st = struct(["id2","val"]).alias("map")
df = df.select("id", st).groupBy("id").agg(collect_list("map").alias("list"))

def sort(x):

    x = sorted(x, key=lambda x:x[0])
    y = list(map(lambda a:a[1], x))
    return(y)
udf_sort = udf(sort, ArrayType(DoubleType()))
df = df.withColumn("list", udf_sort("list"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 @和*与python矩阵乘法有什么区别? - What's the difference between @ and * with python matrix multiplication? 在矩阵和对角矩阵之间进行矩阵乘法的更快方法? - The faster way to do matrix multiplication between a matrix and a diagonal matrix? 龙卷风和基于Python的守护程序之间的最佳通信方式是什么? - What's the best way of communication between tornado and Python based daemon? Python中的大型矩阵乘法 - 最佳选择是什么? - Large matrix multiplication in Python - what is the best option? 在函数中将网格与矩阵乘法相结合的最佳方法 - Best way of combining meshgrid with matrix multiplication in function 内联序列化 Dataframe 的最佳方法是什么? - What's the best way to serialize a Dataframe inline? 计算二进制向量数据帧的相似度矩阵的最佳方法是什么? - What is the best way to compute a similarity matrix for a dataframe of binary vectors? 在熊猫数据框中使用列表作为元素的最佳替代方法是什么? - What's the best alternative to using lists as elements in a pandas dataframe? Python中矩阵乘法的最快方法是什么? - What is the fastest way for matrix multiplication in Python? 基于python中的条件列表搜索多个列表的最佳方法是什么? - What is the best way to search multiple lists based on a criteria list in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM