[英]Perform a nested for loop with RDD.map() in Scala
I'm rather new to Spark and Scala and have a Java background.我对 Spark 和 Scala 比较陌生,并且有 Java 背景。 I have done some programming in haskell, so not completely new to functional programming.
我已经用 Haskell 做过一些编程,所以对函数式编程并不陌生。
I'm trying to accomplish some form of a nested for-loop.我正在尝试完成某种形式的嵌套 for 循环。 I have a RDD which I want to manipulate based on every two elements in the RDD.
我有一个 RDD,我想根据 RDD 中的每两个元素对其进行操作。 The pseudo code (java-like) would look like this:
伪代码(类java)看起来像这样:
// some RDD named rdd is available before this
List list = new ArrayList();
for(int i = 0; i < rdd.length; i++){
list.add(rdd.get(i)._1);
for(int j = 0; j < rdd.length; j++){
if(rdd.get(i)._1 == rdd.get(j)._1){
list.add(rdd.get(j)._1);
}
}
}
// Then now let ._1 of the rdd be this list
My scala solution (that does not work) looks like this:我的 Scala 解决方案(不起作用)如下所示:
val aggregatedTransactions = joinedTransactions.map( f => {
var list = List[Any](f._2._1)
val filtered = joinedTransactions.filter(t => f._1 == t._1)
for(i <- filtered){
list ::= i._2._1
}
(f._1, list, f._2._2)
})
I'm trying to achieve to put item _2._1 into a list if ._1 of both items is equal.如果两个项目的 ._1 相等,我试图实现将项目 _2._1 放入列表中。 I am aware that i cannot do any filter or map function within another map function.
我知道我不能在另一个地图功能中执行任何过滤器或地图功能。 I've read that you could achieve something like this with a join, but I don't see how I could actually get these items into a list or any structure that can be used as list.
我读过您可以通过连接实现类似的功能,但我不知道如何将这些项目放入列表或任何可用作列表的结构中。
How do you achieve an effect like this with RDDs?你如何用 RDD 实现这样的效果?
Assuming your input has the form RDD[(A, (A, B))]
for some types A, B
, and that the expected result should have the form RDD[A]
- not a List (because we want to keep data distributed) - this would seem to do what you need:假设您的输入对于某些类型
A, B
具有RDD[(A, (A, B))]
形式,并且预期结果应具有RDD[A]
形式 - 而不是列表(因为我们希望保持数据分布) - 这似乎可以满足您的需求:
rdd.join(rdd.values).keys
Details :详情:
It's hard to understand the exact input and expected output, as the data structure (type) of neither is explicitly stated, and the requirement is not well explained by the code example.很难理解确切的输入和预期输出,因为两者的数据结构(类型)都没有明确说明,并且代码示例没有很好地解释需求。 So I'll make some assumptions and hope that it will help with your specific case.
因此,我会做出一些假设,并希望它对您的具体情况有所帮助。
For the full example, I'll assume:对于完整示例,我将假设:
RDD[(Int, (Int, Int))]
RDD[(Int, (Int, Int))]
RDD[Int]
, and would contain a lot of duplicates - if the original RDD has the "key" X multiple times, each match (in ._2._1
) would appear once per occurrence of X as a keyRDD[Int]
,并且会包含很多重复项 - 如果原始 RDD 多次具有“键”X,则每个匹配项(在._2._1
)将在每次出现 X 作为键时出现一次If that's the case we're trying to solve - this join
would solve it:如果是这种情况,我们正在尝试解决 - 这个
join
可以解决它:
// Some sample data, assuming all ints
val rdd = sc.parallelize(Seq(
(1, (1, 5)),
(1, (2, 5)),
(2, (1, 5)),
(3, (4, 5))
))
// joining the original RDD with an RDD of the "values" -
// so the joined RDD will have "._2._1" as key
// then we get the keys only, because they equal the values anyway
val result: RDD[Int] = rdd.join(rdd.values).keys
// result is a key-value RDD with the original keys as keys, and a list of matching _2._1
println(result.collect.toList) // List(1, 1, 1, 1, 2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.