简体   繁体   English

Spark Scala:如何使用rdd的每三个元素?

[英]Spark Scala: How to work with each 3 elements of rdd?

everyone. 大家。

I have such problem: 我有这样的问题:

I have very big rdd: billions elements like: 我有一个很大的rdd:数十亿个元素,例如:

Array[((Int, Int), Double)] = Array(((0,0),729.0), ((0,1),169.0), ((0,2),1.0), ((0,3),5.0), ...... ((34,45),34.0), .....)

I need to do such operation: 我需要执行以下操作:

take value of each element by key (i,j) and add to it the 通过键(i,j)获取每个元素的值,然后将其添加到

min(rdd_value[(i-1, j)],rdd_value[(i, j-1)], rdd_value[(i-1, j-1)])

How can I do this without using collect() as After collect() I have got Java memory errror as my rdd is very big. 我怎样才能做到这一点,而无需使用collect()为后collect()我有Java memory errror我的RDD是非常大的。

Thank you very much! 非常感谢你!

I try to realize this algorithm from python. 我尝试从python实现此算法。 when time series are rdds. 当时间序列为rdds时。

def DTWDistance(s1, s2):
    DTW={}

    for i in range(len(s1)):
        DTW[(i, -1)] = float('inf')
    for i in range(len(s2)):
        DTW[(-1, i)] = float('inf')
    DTW[(-1, -1)] = 0

    for i in range(len(s1)):
        for j in range(len(s2)):
            dist= (s1[i]-s2[j])**2
            DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])

    return sqrt(DTW[len(s1)-1, len(s2)-1])

And now I should perform last operation with for loop. 现在,我应该使用for循环执行最后一个操作。 The dist is already calculated. dist已经计算。

Example: 例:

Input (like matrix): 输入(如矩阵):

4 5 1
7 2 3
9 0 1

Rdd looks like Rdd看起来像

rdd.take(10)

Array(((1,1), 4), ((1,2), 5), ((1,3), 1), ((2,1), 7), ((2,2), 2), ((2,3), 3), ((3,1), 9), ((3,2), 0), ((3,3), 1))

I want to do this operation 我想做这个手术

rdd_value[(i, j)] = rdd_value[(i, j)] + min(rdd_value[(i-1, j)],rdd_value[(i, j-1)], rdd_value[(i-1, j-1)])

For example: 例如:

((1, 1), 4) = 4 + min(infinity, infinity, 0) = 4 + 0 = 4


4 5 1
7 2 3
9 0 1

Then 然后

((1, 2), 5) = 5 + min(infinity, 4, infinity) = 5 + 4 = 9


4 9 1
7 2 3
9 0 1

Then 然后

.... ....

Then 然后

((2, 2), 2) = 2 + min(7, 9, 4) = 2 + 4 = 6


4 9 1
7 6 3
9 0 1

Then ..... 然后 .....

((3, 3), 1) = 1 + min(3, 0, 2) = 1 + 0 = 1

A short answer is that the problem you try to solve cannot be efficiently and concisely expressed using Spark. 一个简短的答案是,您尝试解决的问题无法使用Spark高效且简洁地表达。 It doesn't really matter if you choose plain RDDs are distributed matrices. 如果选择普通RDD是分布式矩阵,则实际上并不重要。

To understand why you'll have to think about the Spark programming model. 要了解为什么您必须考虑Spark编程模型。 A fundamental Spark concept is a graph of dependencies where each RDD depends on one or more parent RDDs. 一个基本的Spark概念是一个依赖关系图,其中每个RDD都依赖一个或多个父RDD。 If your problem was defined as follows: 如果您的问题定义如下:

  • given an initial matrix M 0 给定初始矩阵M 0
  • for i <- 1..n 对于我<-1..n
    • find matrix M i where M i (m,n) = M i - 1 (m,n) + min(M i - 1 (m-1,n) , M i - 1 (m-1,n-1) , M i - 1 (m,n-1) ) 找出矩阵M i ,其中M i (m,n) = M i-1 (m,n) + min(M i-1 (m-1,n) ,M i-1 (m-1,n-1) ,M i-1 (m,n-1)

then it would be trivial to express using Spark API ( pseudocode ): 那么使用Spark API( 伪代码 )表达将是微不足道的:

rdd
    .flatMap(lambda ((i, j), v): 
        [((i + 1, j), v), ((i, j + 1), v), ((i + 1, j + 1), v)])
    .reduceByKey(min)
    .union(rdd)
    .reduceByKey(add)

Unfortunately you are trying to express dependencies between individual values in the same data structure. 不幸的是,您试图表达同一数据结构中各个值之间的依赖性。 Spark aside it a problem which is much harder to parallelize not to mention distribute. 除了星火,这个问题很难并行化,更不用说分发了。

This type of dynamic programming is hard to parallelize because at different points is completely or almost completely sequential. 这种动态编程很难并行化,因为在不同点完全或几乎完全是顺序的。 When you try to compute for example M i (0,0) or M i (m,n) there is nothing to parallelize. 例如,当您尝试计算M i (0,0)M i (m,n)时 ,没有什么可并行化的。 It is hard to distribute because it can generate complex dependencies between blocks. 分发很困难,因为它会在块之间生成复杂的依赖关系。

There are non trivial ways to handle this in Spark by computing individual blocks and expressing dependencies between these blocks or using iterative algorithms and propagating messages over the explicit graph (GraphX) but this is far from easy to do it right. 在Spark中,可以通过计算单个块并表达这些块之间的依赖性或使用迭代算法并在显式图(GraphX)上传播消息的简单方法来处理此问题,但这并非易事。

At the end of the day there tools which can be much better choice for this type of computations than Spark. 归根结底,对于这种类型的计算,有一些工具比Spark更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM