简体   繁体   English

在 Scala 中使用 RDD.map() 执行嵌套 for 循环

[英]Perform a nested for loop with RDD.map() in Scala

I'm rather new to Spark and Scala and have a Java background.我对 Spark 和 Scala 比较陌生,并且有 Java 背景。 I have done some programming in haskell, so not completely new to functional programming.我已经用 Haskell 做过一些编程,所以对函数式编程并不陌生。

I'm trying to accomplish some form of a nested for-loop.我正在尝试完成某种形式的嵌套 for 循环。 I have a RDD which I want to manipulate based on every two elements in the RDD.我有一个 RDD,我想根据 RDD 中的每两个元素对其进行操作。 The pseudo code (java-like) would look like this:伪代码(类java)看起来像这样:

// some RDD named rdd is available before this
List list = new ArrayList();
for(int i = 0; i < rdd.length; i++){
   list.add(rdd.get(i)._1);
   for(int j = 0; j < rdd.length; j++){
      if(rdd.get(i)._1 == rdd.get(j)._1){
         list.add(rdd.get(j)._1);
      }
   }
}
// Then now let ._1 of the rdd be this list

My scala solution (that does not work) looks like this:我的 Scala 解决方案(不起作用)如下所示:

  val aggregatedTransactions = joinedTransactions.map( f => {
     var list = List[Any](f._2._1)
     val filtered = joinedTransactions.filter(t => f._1 == t._1)

     for(i <- filtered){
       list ::= i._2._1
     }

     (f._1, list, f._2._2)
  })

I'm trying to achieve to put item _2._1 into a list if ._1 of both items is equal.如果两个项目的 ._1 相等,我试图实现将项目 _2._1 放入列表中。 I am aware that i cannot do any filter or map function within another map function.我知道我不能在另一个地图功能中执行任何过滤器或地图功能。 I've read that you could achieve something like this with a join, but I don't see how I could actually get these items into a list or any structure that can be used as list.我读过您可以通过连接实现类似的功能,但我不知道如何将这些项目放入列表或任何可用作列表的结构中。

How do you achieve an effect like this with RDDs?你如何用 RDD 实现这样的效果?

Assuming your input has the form RDD[(A, (A, B))] for some types A, B , and that the expected result should have the form RDD[A] - not a List (because we want to keep data distributed) - this would seem to do what you need:假设您的输入对于某些类型A, B具有RDD[(A, (A, B))]形式,并且预期结果应具有RDD[A]形式 - 而不是列表(因为我们希望保持数据分布) - 这似乎可以满足您的需求:

rdd.join(rdd.values).keys

Details :详情

It's hard to understand the exact input and expected output, as the data structure (type) of neither is explicitly stated, and the requirement is not well explained by the code example.很难理解确切的输入和预期输出,因为两者的数据结构(类型)都没有明确说明,并且代码示例没有很好地解释需求。 So I'll make some assumptions and hope that it will help with your specific case.因此,我会做出一些假设,并希望它对您的具体情况有所帮助。

For the full example, I'll assume:对于完整示例,我将假设:

  • Input RDD has type RDD[(Int, (Int, Int))]输入 RDD 的类型为RDD[(Int, (Int, Int))]
  • Expected output has the form RDD[Int] , and would contain a lot of duplicates - if the original RDD has the "key" X multiple times, each match (in ._2._1 ) would appear once per occurrence of X as a key预期输出的形式为RDD[Int] ,并且会包含很多重复项 - 如果原始 RDD 多次具有“键”X,则每个匹配项(在._2._1 )将在每次出现 X 作为键时出现一次

If that's the case we're trying to solve - this join would solve it:如果是这种情况,我们正在尝试解决 - 这个join可以解决它:

// Some sample data, assuming all ints
val rdd = sc.parallelize(Seq(
  (1, (1, 5)),
  (1, (2, 5)),
  (2, (1, 5)),
  (3, (4, 5))
))

// joining the original RDD with an RDD of the "values" -
// so the joined RDD will have "._2._1" as key
// then we get the keys only, because they equal the values anyway
val result: RDD[Int] = rdd.join(rdd.values).keys

// result is a key-value RDD with the original keys as keys, and a list of matching _2._1
println(result.collect.toList) // List(1, 1, 1, 1, 2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM