简体   繁体   English

pyspark 计算稀疏向量的距离矩阵

[英]pyspark calculate distance matrix of sparse vectors

I'm trying to build a generic way to calculate a distance matrix of many sparse vectors (100k vectors with a length of 250k).我正在尝试构建一种通用方法来计算许多稀疏向量(长度为 250k 的 100k 个向量)的距离矩阵。 In my example the data is represented in a scipy csr matrix.在我的示例中,数据以 scipy csr 矩阵表示。 This is what I'm doing:这就是我正在做的:

First I'm defining a method to transform the csr rows to pyspark SparseVectors:首先,我定义了一种将 csr 行转换为 pyspark SparseVectors 的方法:

def csr_to_sparse_vector(row):
    return SparseVector(row.shape[1], sorted(row.indices), row.data)

Now I transform the rows into vectors and save them to a list which I then feed to the SparkContext:现在我将行转换为向量并将它们保存到一个列表中,然后我将其提供给 SparkContext:

sparse_vectors = [csr_to_sparse_vector(row) for row in refs_sample]
rdd = sc.parallelize(sparse_vectors)

In the next step I use the cartesian function to build all the pairs (similar to this post: Pyspark calculate custom distance between all vectors in a RDD )在下一步中,我使用笛卡尔函数来构建所有对(类似于这篇文章: Pyspark 计算 RDD 中所有向量之间的自定义距离

In this experiment I want to use tje Jaccard Similarity which is defined accordingly:在这个实验中,我想使用相应定义的 tje Jaccard Similarity:

def jacc_sim(pair):
    dot_product = pair[0].dot(pair[1])
    try:
        sim = dot_product / (pair[0].numNonzeros() + pair[1].numNonzeros())
    except ZeroDivisionError:
        return 0.0
    return sim

Now I should just map the function and collect the results:现在我应该只映射函数并收集结果:

distance_matrix = rdd2.map(lambda x: jacc_sim(x)).collect()

I'm running this code on a small sample with only 100 documents on both, a local machine and a cluster with 180 nodes.我在一个只有 100 个文档的小样本上运行此代码,本地机器和具有 180 个节点的集群。 The task takes forever and finally crashes: https://pastebin.com/UwLUXvUZ任务需要永远,最后崩溃: https : //pastebin.com/UwLUXvUZ

Any suggestions what might be wrong?任何建议可能有什么问题?

Additionally, if the distance measure is symmetric sim(x,y) == sim(y,x) we just need the upper triangle of the matrix.此外,如果距离度量是对称的 sim(x,y) == sim(y,x) 我们只需要矩阵的上三角形。 I found a post that solves this problem by filtering( Upper triangle of cartesian in spark for symmetric operations: `x*(x+1)//2` instead of `x**2` ):我找到了一个通过过滤解决这个问题的帖子( spark中笛卡尔的上三角形用于对称运算:`x*(x+1)//2`而不是`x**2` ):

rdd2 = rdd.cartesian(rdd).filter(lambda x: x[0] < x[1])

But this doesn't work for the list of SparseVectors.但这不适用于 SparseVectors 列表。

The problem was a configuration error that led to split up my data into 1000 partitions.问题是配置错误导致将我的数据分成 1000 个分区。 The solution was simply to tell spark explicitly how many partitions he should create (eg 10):解决方案只是明确地告诉 spark 他应该创建多少个分区(例如 10 个):

rdd = sc.parallelize(sparse_vectors, 10)

Moreover I extended the list of sparse vectors with an enumeration, this way I could then filter out pairs which are not part of the upper triangle matrix:此外,我使用枚举扩展了稀疏向量列表,这样我就可以过滤掉不属于上三角矩阵的对:

sparse_vectors = [(i, csr_to_sparse_vector(row)) for i, row in enumerate(authors)]
rdd = sc.parallelize(sparse_vectors, 10)
rdd2 = rdd.cartesian(rdd).filter(lambda x: x[0][0] < x[1][0])
rdd2.map(lambda x: jacc_sim(x)).filter(lambda x: x is not None).saveAsTextFile('hdfs:///user/username/similarities')

The belonging similarity functions looks like this:归属相似度函数如下所示:

def jacc_sim(pair):
    id_0 = pair[0][0]
    vec_0 = pair[0][1]
    id_1 = pair[1][0]
    vec_1 = pair[1][1]
    dot_product = vec_0.dot(vec_1)
    try:
        sim = dot_product / (vec_0.numNonzeros() + vec_1.numNonzeros())
        if sim > 0:
            return (id_0, id_1, sim)
    except ZeroDivisionError:
        pass
    return None

This worked very well for me and I hope someone else will find it useful as well!这对我来说非常有效,我希望其他人也会发现它很有用!

Is it the list that's problematic, or that SparseVectors comprise the list?是列表有问题,还是 SparseVectors 包含列表? One thought is to try converting the SparseVectors to DenseVectors, a suggestion I found here ( Convert Sparse Vector to Dense Vector in Pyspark ).一个想法是尝试将 SparseVectors 转换为 DenseVectors,这是我在这里找到的一个建议( 将稀疏向量转换为 Pyspark 中的密集向量)。 The calculation result is no different, just how Spark handles it.计算结果没有什么不同,只是Spark如何处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM