简体   繁体   English

如何将三个RDD加入一个元组?

[英]How to join three RDDs in to a tuple?

I am relatively new to Apache Spark in Python, and here is what I am trying to do. 我对使用Python的Apache Spark还是比较陌生,这就是我想要做的。 I have input data as following. 我有以下输入数据。

  • rdd_row is a RDD of row indices (i), rdd_row是行索引(i)的RDD,
  • rdd_col is a RDD of column indices (j), rdd_col是列索引(j)的RDD,
  • rdd_values is a RDD of Values (v). rdd_values是值(v)的RDD。

The above three RDDs are huge. 以上三个RDD很大。

I am trying to convert them to a sparse rdd matrix 我正在尝试将它们转换为稀疏的rdd矩阵

rdd_mat= ([rdd_row],[rdd_col],[rdd_values])

ie,

rdd_mat=([i1,i2,i3 ..],[j1,j2,j3..], [v1,v2,v3 ..])

I have tried: 我努力了:

zip where rdd_row.zip(rdd_col).zip(rdd_val) 

but it ends up giving 但它最终给

[(i1,j1,v1),(i2,j2,v2) ..]

and

rdd1.union(rdd2) 

won't create a tuple. 不会创建元组。

Help guiding me in the right direction is much appreciated! 非常感谢您在正确的方向上指导我!

Unfortunately at this point (Spark 1.4) Scala and Java are much better choice than Python if you're interested in linear algebra. 不幸的是,在这一点上(Spark 1.4),如果您对线性代数感兴趣,Scala和Java比Python更好。 Assuming you have input as below: 假设您输入了以下内容:

import numpy as np
np.random.seed(323) 

rdd_row = sc.parallelize([0, 1, 1, 2, 3])
rdd_col = sc.parallelize([1, 2, 3, 4, 4])
rdd_vals = sc.parallelize(np.random.uniform(0, 1, size=5))

to get a rdd_mat of the desired shape you can do something like this: 要获得所需形状的rdd_mat ,可以执行以下操作:

assert rdd_row.count() == rdd_col.count() == rdd_vals.count()
rdd_mat = sc.parallelize(
    (rdd_row.collect(), rdd_row.collect(), rdd_vals.collect()))

but it is a rather bad idea. 但这是一个很糟糕的主意。 As already mentioned by @DeanLa parallel processing here is extremely limited not to mention every part (a whole rows list for example) will end up on a single partition / node. 正如@DeanLa所提到的,这里的并行处理非常有限,更不用说每个部分(例如整个行列表)都将在单个分区/节点上结束。

Without knowing how do you want to use the output it is hard to give you a meaningful advice but one approach is to use something as below: 在不知道如何使用输出的情况下,很难为您提供有意义的建议,但是一种方法是使用以下内容:

from pyspark.mllib.linalg import Vectors

coords = (rdd_row.
    zip(rdd_col).
    zip(rdd_vals).
    map(lambda ((row, col), val): (row, col, val)).
    cache())

ncol = coords.map(lambda (row, col, val): col).distinct().count()

rows = (coords.
    groupBy(lambda (row, col, val): row)
    .mapValues(lambda values: Vectors.sparse(
        ncol, sorted((col, val) for (row, col, val) in values))))

It will create a rdd of pairs representing row index and sparse vector of values for a given row. 它将创建成对的rdd,它们代表给定行的行索引和值的稀疏向量。 If you add some joins or add group by column you can implement some typical linear algebra routines by yourself nevertheless for full featured distributed data structures it is better to use Scala / Java CoordinateMatrix or another class from org.apache.spark.mllib.linalg.distributed 如果您添加一些联接或逐列添加,您仍然可以自己实现一些典型的线性代数例程,以实现功能齐全的分布式数据结构,最好使用Scala / Java CoordinateMatrixorg.apache.spark.mllib.linalg.distributed另一个类org.apache.spark.mllib.linalg.distributed

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM