简体   繁体   中英

Perform a nested for loop with RDD.map() in Scala

I'm rather new to Spark and Scala and have a Java background. I have done some programming in haskell, so not completely new to functional programming.

I'm trying to accomplish some form of a nested for-loop. I have a RDD which I want to manipulate based on every two elements in the RDD. The pseudo code (java-like) would look like this:

// some RDD named rdd is available before this
List list = new ArrayList();
for(int i = 0; i < rdd.length; i++){
   list.add(rdd.get(i)._1);
   for(int j = 0; j < rdd.length; j++){
      if(rdd.get(i)._1 == rdd.get(j)._1){
         list.add(rdd.get(j)._1);
      }
   }
}
// Then now let ._1 of the rdd be this list

My scala solution (that does not work) looks like this:

  val aggregatedTransactions = joinedTransactions.map( f => {
     var list = List[Any](f._2._1)
     val filtered = joinedTransactions.filter(t => f._1 == t._1)

     for(i <- filtered){
       list ::= i._2._1
     }

     (f._1, list, f._2._2)
  })

I'm trying to achieve to put item _2._1 into a list if ._1 of both items is equal. I am aware that i cannot do any filter or map function within another map function. I've read that you could achieve something like this with a join, but I don't see how I could actually get these items into a list or any structure that can be used as list.

How do you achieve an effect like this with RDDs?

Assuming your input has the form RDD[(A, (A, B))] for some types A, B , and that the expected result should have the form RDD[A] - not a List (because we want to keep data distributed) - this would seem to do what you need:

rdd.join(rdd.values).keys

Details :

It's hard to understand the exact input and expected output, as the data structure (type) of neither is explicitly stated, and the requirement is not well explained by the code example. So I'll make some assumptions and hope that it will help with your specific case.

For the full example, I'll assume:

  • Input RDD has type RDD[(Int, (Int, Int))]
  • Expected output has the form RDD[Int] , and would contain a lot of duplicates - if the original RDD has the "key" X multiple times, each match (in ._2._1 ) would appear once per occurrence of X as a key

If that's the case we're trying to solve - this join would solve it:

// Some sample data, assuming all ints
val rdd = sc.parallelize(Seq(
  (1, (1, 5)),
  (1, (2, 5)),
  (2, (1, 5)),
  (3, (4, 5))
))

// joining the original RDD with an RDD of the "values" -
// so the joined RDD will have "._2._1" as key
// then we get the keys only, because they equal the values anyway
val result: RDD[Int] = rdd.join(rdd.values).keys

// result is a key-value RDD with the original keys as keys, and a list of matching _2._1
println(result.collect.toList) // List(1, 1, 1, 1, 2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM