pyspark 计算稀疏向量的距离矩阵

Question

I'm trying to build a generic way to calculate a distance matrix of many sparse vectors (100k vectors with a length of 250k).我正在尝试构建一种通用方法来计算许多稀疏向量（长度为 250k 的 100k 个向量）的距离矩阵。 In my example the data is represented in a scipy csr matrix.在我的示例中，数据以 scipy csr 矩阵表示。 This is what I'm doing:这就是我正在做的：

First I'm defining a method to transform the csr rows to pyspark SparseVectors:首先，我定义了一种将 csr 行转换为 pyspark SparseVectors 的方法：

def csr_to_sparse_vector(row):
    return SparseVector(row.shape[1], sorted(row.indices), row.data)

Now I transform the rows into vectors and save them to a list which I then feed to the SparkContext:现在我将行转换为向量并将它们保存到一个列表中，然后我将其提供给 SparkContext：

sparse_vectors = [csr_to_sparse_vector(row) for row in refs_sample]
rdd = sc.parallelize(sparse_vectors)

In the next step I use the cartesian function to build all the pairs (similar to this post: Pyspark calculate custom distance between all vectors in a RDD )在下一步中，我使用笛卡尔函数来构建所有对（类似于这篇文章： Pyspark 计算 RDD 中所有向量之间的自定义距离）

In this experiment I want to use tje Jaccard Similarity which is defined accordingly:在这个实验中，我想使用相应定义的 tje Jaccard Similarity：

def jacc_sim(pair):
    dot_product = pair[0].dot(pair[1])
    try:
        sim = dot_product / (pair[0].numNonzeros() + pair[1].numNonzeros())
    except ZeroDivisionError:
        return 0.0
    return sim

Now I should just map the function and collect the results:现在我应该只映射函数并收集结果：

distance_matrix = rdd2.map(lambda x: jacc_sim(x)).collect()

I'm running this code on a small sample with only 100 documents on both, a local machine and a cluster with 180 nodes.我在一个只有 100 个文档的小样本上运行此代码，本地机器和具有 180 个节点的集群。 The task takes forever and finally crashes: https://pastebin.com/UwLUXvUZ任务需要永远，最后崩溃： https : //pastebin.com/UwLUXvUZ

Any suggestions what might be wrong?任何建议可能有什么问题？

Additionally, if the distance measure is symmetric sim(x,y) == sim(y,x) we just need the upper triangle of the matrix.此外，如果距离度量是对称的 sim(x,y) == sim(y,x) 我们只需要矩阵的上三角形。 I found a post that solves this problem by filtering( Upper triangle of cartesian in spark for symmetric operations: `x*(x+1)//2` instead of `x**2` ):我找到了一个通过过滤解决这个问题的帖子（ spark中笛卡尔的上三角形用于对称运算：`x*(x+1)//2`而不是`x**2` ）：

rdd2 = rdd.cartesian(rdd).filter(lambda x: x[0] < x[1])

But this doesn't work for the list of SparseVectors.但这不适用于 SparseVectors 列表。

Answer 1

The problem was a configuration error that led to split up my data into 1000 partitions.问题是配置错误导致将我的数据分成 1000 个分区。 The solution was simply to tell spark explicitly how many partitions he should create (eg 10):解决方案只是明确地告诉 spark 他应该创建多少个分区（例如 10 个）：

rdd = sc.parallelize(sparse_vectors, 10)

Moreover I extended the list of sparse vectors with an enumeration, this way I could then filter out pairs which are not part of the upper triangle matrix:此外，我使用枚举扩展了稀疏向量列表，这样我就可以过滤掉不属于上三角矩阵的对：

sparse_vectors = [(i, csr_to_sparse_vector(row)) for i, row in enumerate(authors)]
rdd = sc.parallelize(sparse_vectors, 10)
rdd2 = rdd.cartesian(rdd).filter(lambda x: x[0][0] < x[1][0])
rdd2.map(lambda x: jacc_sim(x)).filter(lambda x: x is not None).saveAsTextFile('hdfs:///user/username/similarities')

The belonging similarity functions looks like this:归属相似度函数如下所示：

def jacc_sim(pair):
    id_0 = pair[0][0]
    vec_0 = pair[0][1]
    id_1 = pair[1][0]
    vec_1 = pair[1][1]
    dot_product = vec_0.dot(vec_1)
    try:
        sim = dot_product / (vec_0.numNonzeros() + vec_1.numNonzeros())
        if sim > 0:
            return (id_0, id_1, sim)
    except ZeroDivisionError:
        pass
    return None

This worked very well for me and I hope someone else will find it useful as well!这对我来说非常有效，我希望其他人也会发现它很有用！

Answer 2

Is it the list that's problematic, or that SparseVectors comprise the list?是列表有问题，还是 SparseVectors 包含列表？ One thought is to try converting the SparseVectors to DenseVectors, a suggestion I found here ( Convert Sparse Vector to Dense Vector in Pyspark ).一个想法是尝试将 SparseVectors 转换为 DenseVectors，这是我在这里找到的一个建议（将稀疏向量转换为 Pyspark 中的密集向量）。 The calculation result is no different, just how Spark handles it.计算结果没有什么不同，只是Spark如何处理。

pyspark 计算稀疏向量的距离矩阵

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-08-10 09:39:47

解决方案2
0 2017-08-08 18:44:31

pyspark 计算稀疏向量的距离矩阵

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-08-10 09:39:47

解决方案2 0 2017-08-08 18:44:31

解决方案1
3 已采纳 2017-08-10 09:39:47

解决方案2
0 2017-08-08 18:44:31